Editor's Note: Wojciech Rosinski, a machine learning scientist at the University of Warsaw, introduces the main methods of category coding.
introduce
This is the first article in a series on feature engineering methods. In machine learning practice, feature engineering is one of the most important and loosely defined aspects. It can be considered an art, with no strict rules; creativity is key.
Feature engineering aims to create better information representations for machine learning models. Even with non-linear algorithms, we cannot model all interactions (relationships) between variables in a dataset using raw data. Therefore, we need to manually explore and process the data.
This raises a question—what does deep learning mean? Deep learning aims to minimize the need for manual data processing, enabling models to learn appropriate data representations on their own. Deep learning performs better on data like images, speech, and text, where no additional "metadata" is provided. And on tabular data, nothing beats gradient boosting tree methods such as XGBoost or LightGBM. Machine learning competitions prove this—in almost all winning solutions for tabular data, decision tree-based models are optimal, while deep learning models typically don't achieve such good results (although they work very well when combined with decision tree-based models ;-)).
The bias in feature engineering lies in domain knowledge. This is why different feature engineering methods should be used for each dataset, depending on the problem being solved. However, there are still some widely used methods worth trying to see if they can improve model performance. HJvavVeen's presentation provided a wealth of practical information. Some of the methods below are implemented based on the descriptions in that presentation.
This article uses the KaggleDays dataset as an example, and the encoding method is introduced with reference to the above lecture.
Dataset
The data comes from Reddit and includes questions and answers. The goal is to predict the number of upvotes for each answer. This dataset is used as an example because it contains text and standard features.
Include the necessary libraries:
importgc
import numpyasnp
import pandasaspd
Loading data:
X=pd.read_csv('../input/train.csv',sep="\t",index_col='id')
List:
['question_id',
'subreddit',
'question_utc',
'question_text',
'question_score',
'answer_utc',
'answer_text',
'answer_score']
Each `question_id` corresponds to a specific question (see `question_text`). Each `question_id` may appear multiple times because each row contains a different answer to that question (see `answer_text`). The date and time of the question and answer are provided by the `_utc` column. Information about the subreddit where the question was posted is also included. `question_score` is the number of upvotes for the question, and `answer_score` is the number of upvotes for the answer. `answer_score` is the target variable.
The data needs to be divided into training and validation subsets based on the question_id, similar to Kaggle's approach of dividing the data into training and testing sets.
question_ids=X.question_id.unique()
question_ids_train=set(pd.Series(question_ids).sample(frac=0.8))
question_ids_valid=set(question_ids).difference(question_ids_train)
X_train=X[X.question_id.isin(question_ids_train)]
X_valid=X[X.question_id.isin(question_ids_valid)]
Categorical features and numerical features
Machine learning models can only handle numbers. Numerical (continuous, quantitative) variables are variables that can take any value within a finite or infinite range; they can be naturally represented by numbers and therefore can be used directly in the model. Raw categorical variables typically exist as strings and need to be transformed before being fed into the model.
subreddit is a good example of a categorical variable, containing 41 different categories, for example:
['AskReddit','Jokes','politics','explainlikeimfive','gaming']
Let's take a look at the most popular category (X.subreddit.value_counts()[:5]):
AskReddit275667
politics123003
news42271
worldnews40016
gaming32117
Name: subreddit, dtype: int64
An example of a numeric variable is `question_score`, whose information can be viewed using `X.question_score.describe()`.
mean770.891169
std3094.752794
min1.000000
25% 2.000000
50%11.000000
75%112.000000
max48834.000000
Name:question_score,dtype:float64
Category feature coding
Two basic methods of categorical encoding are one-hot encoding and label encoding. One-hot encoding can be performed using pandas.get_dummies. The encoding result of a variable with K classes is a K-column binary matrix, where a value of 1 in the i-th column means that the observation belongs to the i-th class.
Label encoding directly converts categories into numbers. This functionality is provided by `pandas.factorize`, or, in pandas, columns of type `category` can be represented by `cat.codes`. Using label encoding preserves the original dimensions.
There are also some less standard encoding methods worth trying, which may improve the model's performance. Three methods will be introduced here:
Frequency encoding
labelcount encoding
Target encoding
Frequency coding
Frequency coding replaces the category with frequency, which is calculated based on the training set. This method is sensitive to outliers, so the results can be normalized or transformed (e.g., using a logarithmic transformation). Unknown categories can be replaced with 1.
While the likelihood isn't very high, some variables might have the same frequency, which could lead to collisions—two categories encoding the same value. It's impossible to say whether this will cause model degradation or improvement, but in principle, we don't want this to happen.
defcount_encode(X,categorical_features,normalize=False):
print('Countencoding:{}'.format(categorical_features))
X_=pd.DataFrame()
forcat_featureincategorical_features:
X_[cat_feature]=X[cat_feature].astype(
'object').map(X[cat_feature].value_counts())
ifnormalize:
X_[cat_feature]=X_[cat_feature]/np.max(X_[cat_feature])
X_=X_.add_suffix('_count_encoded')
ifnormalize:
X_ = X_.astype(np.float32)
X_=X_.add_suffix('_normalized')
else:
X_ = X_.astype(np.uint32)
returnX_
Let's code the subreddit column:
train_count_subreddit=count_encode(X_train,['subreddit'])
And view the results. The 5 most popular subreddits:
AskReddit221941
politics98233
news33559
worldnews32010
gaming25567
Name: subreddit, dtype: int64
The encoding is:
221941221941
9823398233
3355933559
3201032010
2556725567
Name:subreddit_count_encoded,dtype:int64
Essentially, this replaces the subreddit category with frequency. We can also divide by the frequency of the most frequent category to get a normalized value:
1.000000221941
0.44260998233
0.15120733559
0.14422832010
0.11519725567
Name:subreddit_count_encoded_normalized,dtype:int64
LabelCount encoding
The method we will describe below is called LabelCount encoding, which sorts categories (ascending or descending) based on their frequency in the training set. Compared to standard frequency encoding, LabelCount has a specific advantage—it is insensitive to outliers and does not give the same encoding to different values.
deflabelcount_encode(X,categorical_features,ascending=False):
print('LabelCountencoding:{}'.format(categorical_features))
X_=pd.DataFrame()
forcat_featureincategorical_features:
cat_feature_value_counts=X[cat_feature].value_counts()
value_counts_list=cat_feature_value_counts.index.tolist()
ifascending:
#Ascending order
value_counts_range=list(
reversed(range(len(cat_feature_value_counts))))
else:
#Descending order
value_counts_range=list(range(len(cat_feature_value_counts)))
labelcount_dict=dict(zip(value_counts_list,value_counts_range))
X_[cat_feature]=X[cat_feature].map(
labelcount_dict)
X_=X_.add_suffix('_labelcount_encoded')
ifascending:
X_=X_.add_suffix('_ascending')
else:
X_=X_.add_suffix('_descending')
X_ = X_.astype(np.uint32)
returnX_
coding:
train_lc_subreddit=labelcount_encode(X_train,['subreddit'])
This section uses descending order by default. The five most popular categories in the subreddit column are:
0221941
198233
233559
332010
425567
Name:subreddit_labelcount_encoded_descending,dtype:int64
AskReddit is the most frequent category, so it was converted to 0, which is the first position.
If using ascending order, the codes for these 5 categories are as follows:
40221941
3998233
3833559
3732010
3625567
Name:subreddit_labelcount_encoded_ascending,dtype:int64
Target encoding
Finally, and most skillfully, is the target encoding method. It uses the mean of the target variable to encode the categorical variable. We compute a statistic (here, the mean) of the target variable for each group in the training set, and then merge the validation and test sets to capture the relationship between the groups and the target.
To give a more concrete example, we can calculate the average answer_score for each subreddit, so we can have a rough estimate of how many upvotes we can expect to get for a post on a particular subreddit.
When using a target variable, it is crucial to avoid revealing any information about the validation set. All features based on the target encoding should be computed on the training set, and then the validation and test sets should only be merged or concatenated. Even if the target variable is present in the validation set, it should not be used for any encoding computations, otherwise it will give an overly optimistic estimate of the validation error.
If K-fold cross-validation is used, target-based features should be computed within each fold. If only a single split is performed, then target encoding should be done after separating the training and validation sets.
Furthermore, we can avoid encoding specific categories as 0 by smoothing. Another approach is to avoid potential overfitting by adding random noise.
When handled properly, target encoding is the optimal encoding method for both linear and nonlinear models.
deftarget_encode(X,X_valid,categorical_features,X_test=None,
target_feature='target'):
print('TargetEncoding:{}'.format(categorical_features))
X_=pd.DataFrame()
X_valid_=pd.DataFrame()
ifX_testisnotNone:
X_test_=pd.DataFrame()
forcat_featureincategorical_features:
group_target_mean=X.groupby([cat_feature])[target_feature].mean()
X_[cat_feature]=X[cat_feature].map(group_target_mean)
X_valid_[cat_feature]=X_valid[cat_feature].map(group_target_mean)
X_ = X_.astype(np.float32)
X_=X_.add_suffix('_target_encoded')
X_valid_=X_valid_.astype(np.float32)
X_valid_=X_valid_.add_suffix('_target_encoded')
ifX_testisnotNone:
X_test_[cat_feature]=X_test[cat_feature].map(group_target_mean)
X_test_=X_test_.astype(np.float32)
X_test_=X_test_.add_suffix('_target_encoded')
return X_, X_valid_, X_test_
return X_, X_valid_
coding:
train_tm_subreddit,valid_tm_subreddit=target_encode(
X_train,X_valid,categorical_features=['subreddit'],
target_feature='answer_score')
If we examine the encoded values, we'll find a significant difference in the average number of upvotes across different Reddit accounts:
23.406061220014
13.08269998176
19.02084533916
17.52188731869
18.23542425520
21.53547724692
18.64028220416
23.68889020009
3.15940118695
Name:subreddit_target_encoded,dtype:int64
AskReddit220014
politics98176
news33916
worldnews31869
gaming25520
todayilearned24692
funny20416
videos20009
teenagers18695
Name: subreddit, dtype: int64
Answers on AskReddit average 23.4 upvotes, while answers on politics and teenagers only get 13.1 upvotes each. Such features can be very powerful because they allow us to explicitly encode some target information within a feature set.
Get the category's encoded value
Without modifying the encoding function, we can merge the values obtained on the validation or test set in the following way:
encoded=train_lc_subreddit.subreddit_labelcount_encoded_descending.value_counts().index.values
raw=X_train.subreddit.value_counts().index.values
encoding_dict=dict(zip(raw,encoded))
X_valid['subreddit_labelcount_encoded_descending']=X_valid.loc[:,
'subreddit'].map(
encoding_dict)