Using the KaggleDays dataset as an example, the encoding method is introduced.

Editor's Note: Wojciech Rosinski, a machine learning scientist at the University of Warsaw, introduces the main methods of category coding.

introduce

This is the first article in a series on feature engineering methods. In machine learning practice, feature engineering is one of the most important and loosely defined aspects. It can be considered an art, with no strict rules; creativity is key.

Feature engineering aims to create better information representations for machine learning models. Even with non-linear algorithms, we cannot model all interactions (relationships) between variables in a dataset using raw data. Therefore, we need to manually explore and process the data.

This raises a question—what does deep learning mean? Deep learning aims to minimize the need for manual data processing, enabling models to learn appropriate data representations on their own. Deep learning performs better on data like images, speech, and text, where no additional "metadata" is provided. And on tabular data, nothing beats gradient boosting tree methods such as XGBoost or LightGBM. Machine learning competitions prove this—in almost all winning solutions for tabular data, decision tree-based models are optimal, while deep learning models typically don't achieve such good results (although they work very well when combined with decision tree-based models ;-)).

The bias in feature engineering lies in domain knowledge. This is why different feature engineering methods should be used for each dataset, depending on the problem being solved. However, there are still some widely used methods worth trying to see if they can improve model performance. HJvavVeen's presentation provided a wealth of practical information. Some of the methods below are implemented based on the descriptions in that presentation.

This article uses the KaggleDays dataset as an example, and the encoding method is introduced with reference to the above lecture.

Dataset

The data comes from Reddit and includes questions and answers. The goal is to predict the number of upvotes for each answer. This dataset is used as an example because it contains text and standard features.

Include the necessary libraries:

importgc

import numpyasnp

import pandasaspd

Loading data:

X=pd.read_csv('../input/train.csv',sep="\t",index_col='id')

List:

['question_id',

'subreddit',

'question_utc',

'question_text',

'question_score',

'answer_utc',

'answer_text',

'answer_score']

Each `question_id` corresponds to a specific question (see `question_text`). Each `question_id` may appear multiple times because each row contains a different answer to that question (see `answer_text`). The date and time of the question and answer are provided by the `_utc` column. Information about the subreddit where the question was posted is also included. `question_score` is the number of upvotes for the question, and `answer_score` is the number of upvotes for the answer. `answer_score` is the target variable.

The data needs to be divided into training and validation subsets based on the question_id, similar to Kaggle's approach of dividing the data into training and testing sets.

question_ids=X.question_id.unique()

question_ids_train=set(pd.Series(question_ids).sample(frac=0.8))

question_ids_valid=set(question_ids).difference(question_ids_train)

X_train=X[X.question_id.isin(question_ids_train)]

X_valid=X[X.question_id.isin(question_ids_valid)]

Categorical features and numerical features

Machine learning models can only handle numbers. Numerical (continuous, quantitative) variables are variables that can take any value within a finite or infinite range; they can be naturally represented by numbers and therefore can be used directly in the model. Raw categorical variables typically exist as strings and need to be transformed before being fed into the model.

subreddit is a good example of a categorical variable, containing 41 different categories, for example:

['AskReddit','Jokes','politics','explainlikeimfive','gaming']

Let's take a look at the most popular category (X.subreddit.value_counts()[:5]):

AskReddit275667

politics123003

news42271

worldnews40016

gaming32117

Name: subreddit, dtype: int64

An example of a numeric variable is `question_score`, whose information can be viewed using `X.question_score.describe()`.

mean770.891169

std3094.752794

min1.000000

25% 2.000000

50%11.000000

75%112.000000

max48834.000000

Name:question_score,dtype:float64

Category feature coding

Two basic methods of categorical encoding are one-hot encoding and label encoding. One-hot encoding can be performed using pandas.get_dummies. The encoding result of a variable with K classes is a K-column binary matrix, where a value of 1 in the i-th column means that the observation belongs to the i-th class.

Label encoding directly converts categories into numbers. This functionality is provided by `pandas.factorize`, or, in pandas, columns of type `category` can be represented by `cat.codes`. Using label encoding preserves the original dimensions.

There are also some less standard encoding methods worth trying, which may improve the model's performance. Three methods will be introduced here:

Frequency encoding

labelcount encoding

Target encoding

Frequency coding

Frequency coding replaces the category with frequency, which is calculated based on the training set. This method is sensitive to outliers, so the results can be normalized or transformed (e.g., using a logarithmic transformation). Unknown categories can be replaced with 1.

While the likelihood isn't very high, some variables might have the same frequency, which could lead to collisions—two categories encoding the same value. It's impossible to say whether this will cause model degradation or improvement, but in principle, we don't want this to happen.

defcount_encode(X,categorical_features,normalize=False):

print('Countencoding:{}'.format(categorical_features))

X_=pd.DataFrame()

forcat_featureincategorical_features:

X_[cat_feature]=X[cat_feature].astype(

'object').map(X[cat_feature].value_counts())

ifnormalize:

X_[cat_feature]=X_[cat_feature]/np.max(X_[cat_feature])

X_=X_.add_suffix('_count_encoded')

ifnormalize:

X_ = X_.astype(np.float32)

X_=X_.add_suffix('_normalized')

else:

X_ = X_.astype(np.uint32)

returnX_

Let's code the subreddit column:

train_count_subreddit=count_encode(X_train,['subreddit'])

And view the results. The 5 most popular subreddits:

AskReddit221941

politics98233

news33559

worldnews32010

gaming25567

Name: subreddit, dtype: int64

The encoding is:

221941221941

9823398233

3355933559

3201032010

2556725567

Name:subreddit_count_encoded,dtype:int64

Essentially, this replaces the subreddit category with frequency. We can also divide by the frequency of the most frequent category to get a normalized value:

1.000000221941

0.44260998233

0.15120733559

0.14422832010

0.11519725567

Name:subreddit_count_encoded_normalized,dtype:int64

LabelCount encoding

The method we will describe below is called LabelCount encoding, which sorts categories (ascending or descending) based on their frequency in the training set. Compared to standard frequency encoding, LabelCount has a specific advantage—it is insensitive to outliers and does not give the same encoding to different values.

deflabelcount_encode(X,categorical_features,ascending=False):

print('LabelCountencoding:{}'.format(categorical_features))

X_=pd.DataFrame()

forcat_featureincategorical_features:

cat_feature_value_counts=X[cat_feature].value_counts()

value_counts_list=cat_feature_value_counts.index.tolist()

ifascending:

#Ascending order

value_counts_range=list(

reversed(range(len(cat_feature_value_counts))))

else:

#Descending order

value_counts_range=list(range(len(cat_feature_value_counts)))

labelcount_dict=dict(zip(value_counts_list,value_counts_range))

X_[cat_feature]=X[cat_feature].map(

labelcount_dict)

X_=X_.add_suffix('_labelcount_encoded')

ifascending:

X_=X_.add_suffix('_ascending')

else:

X_=X_.add_suffix('_descending')

X_ = X_.astype(np.uint32)

returnX_

coding:

train_lc_subreddit=labelcount_encode(X_train,['subreddit'])

This section uses descending order by default. The five most popular categories in the subreddit column are:

0221941

198233

233559

332010

425567

Name:subreddit_labelcount_encoded_descending,dtype:int64

AskReddit is the most frequent category, so it was converted to 0, which is the first position.

If using ascending order, the codes for these 5 categories are as follows:

40221941

3998233

3833559

3732010

3625567

Name:subreddit_labelcount_encoded_ascending,dtype:int64

Target encoding

Finally, and most skillfully, is the target encoding method. It uses the mean of the target variable to encode the categorical variable. We compute a statistic (here, the mean) of the target variable for each group in the training set, and then merge the validation and test sets to capture the relationship between the groups and the target.

To give a more concrete example, we can calculate the average answer_score for each subreddit, so we can have a rough estimate of how many upvotes we can expect to get for a post on a particular subreddit.

When using a target variable, it is crucial to avoid revealing any information about the validation set. All features based on the target encoding should be computed on the training set, and then the validation and test sets should only be merged or concatenated. Even if the target variable is present in the validation set, it should not be used for any encoding computations, otherwise it will give an overly optimistic estimate of the validation error.

If K-fold cross-validation is used, target-based features should be computed within each fold. If only a single split is performed, then target encoding should be done after separating the training and validation sets.

Furthermore, we can avoid encoding specific categories as 0 by smoothing. Another approach is to avoid potential overfitting by adding random noise.

When handled properly, target encoding is the optimal encoding method for both linear and nonlinear models.

deftarget_encode(X,X_valid,categorical_features,X_test=None,

target_feature='target'):

print('TargetEncoding:{}'.format(categorical_features))

X_=pd.DataFrame()

X_valid_=pd.DataFrame()

ifX_testisnotNone:

X_test_=pd.DataFrame()

forcat_featureincategorical_features:

group_target_mean=X.groupby([cat_feature])[target_feature].mean()

X_[cat_feature]=X[cat_feature].map(group_target_mean)

X_valid_[cat_feature]=X_valid[cat_feature].map(group_target_mean)

X_ = X_.astype(np.float32)

X_=X_.add_suffix('_target_encoded')

X_valid_=X_valid_.astype(np.float32)

X_valid_=X_valid_.add_suffix('_target_encoded')

ifX_testisnotNone:

X_test_[cat_feature]=X_test[cat_feature].map(group_target_mean)

X_test_=X_test_.astype(np.float32)

X_test_=X_test_.add_suffix('_target_encoded')

return X_, X_valid_, X_test_

return X_, X_valid_

coding:

train_tm_subreddit,valid_tm_subreddit=target_encode(

X_train,X_valid,categorical_features=['subreddit'],

target_feature='answer_score')

If we examine the encoded values, we'll find a significant difference in the average number of upvotes across different Reddit accounts:

23.406061220014

13.08269998176

19.02084533916

17.52188731869

18.23542425520

21.53547724692

18.64028220416

23.68889020009

3.15940118695

Name:subreddit_target_encoded,dtype:int64

AskReddit220014

politics98176

news33916

worldnews31869

gaming25520

todayilearned24692

funny20416

videos20009

teenagers18695

Name: subreddit, dtype: int64

Answers on AskReddit average 23.4 upvotes, while answers on politics and teenagers only get 13.1 upvotes each. Such features can be very powerful because they allow us to explicitly encode some target information within a feature set.

Get the category's encoded value

Without modifying the encoding function, we can merge the values obtained on the validation or test set in the following way:

encoded=train_lc_subreddit.subreddit_labelcount_encoded_descending.value_counts().index.values

raw=X_train.subreddit.value_counts().index.values

encoding_dict=dict(zip(raw,encoded))

X_valid['subreddit_labelcount_encoded_descending']=X_valid.loc[:,

'subreddit'].map(

encoding_dict)

Using the KaggleDays dataset as an example, the encoding method is introduced.

Read next

CATDOLL 146CM Mila TPE

CATDOLL Cici 109CM TPE (Soft Silicone Head with ivory Tone)

CATDOLL 128CM Hedi

CATDOLL Airi Soft Silicone Head