Advanced categoric encoding like a pro: OneHot, MeanTarget, WOE, Frequency, Factorization

A common task for most traditional Data Science projects is transformation of categoric variables.

Most times categoric variables are represented by some text values. E.g. [‘red’, ‘green’, ‘blue’], which need to be represented by some meaningful numeric equivalent.

There are various different ways, and some of them depend on the type of final model to be trained.

In this article I will describe the top 5 best practice options with explanation and code in python. We will be working with titanic dataset from Kaggle. Download it to follow along with the examples.

import pandas as pd
train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')

train.head()

Okay. So all the further examples will be demonstrated on encoding the categoric column ‘Embarked’ with 3 distinct categories ‘S’, ‘C’, ‘Q’. It includes missing values too…

Let’s go over all the possible options.

OneHotEncoding

First logic, then code

Logic

One-hot-encoding is simply a representation of a single categoric column by a number of binary columns. This is a simples way to encode a categoric column that would fit all the machine learning algorithms out there. It is not a very strong technique, but certainly the safest, which will ensure less overfitting and preserve true categoric nature of the feature. So here is how the feature will change eventually:

Main drawback is a possibility to dramatically increase you dataset size (in width). Say you’ve had 10 columns with 20 unique categories in each — this will add 200 (10*20) new columns to your dataset increasing it 20-fold.

Other things to consider:

Missing values can be either represented by a dedicated column or just omitted.
One of the three categories could be safely dropped, because if it is not one of the two remaining (0 and 0), then logically it is the third state.

Code

There are multiple different options to develop a function, including the built in methods in pandas, but all of them require a few extra steps that not many will appreciate.

Fortunately there is one option that does it all.

$ pip install verstack

This framework is designed to specifically work with a pandas DataFrame. You just need to pass data (containing the column that needs encoding) and a ‘column name’ to fit_transform and receive that same dataset with the variable encoded.

Another important feature is an option to use the fitted encoder to transform your test set. If you don’t have a test set right away (as it usually is in real life), you can serialize the fitted encoder with pickle and use it whenever you need.

Factorization

Logic

This is the simples encoding option — just represent string categories by integers. The reason I didn’t describe it at the start, is because this technique should be used with caution. When you represent ‘S’, ‘C’, ‘Q by 0,1,2 — this column becomes ordinal == order matters. Like grades in school 0 < 1 < 2. So linear algorithms will treat it like an ordinal feature. When in reality you can not say that ‘S’ > ‘Q’.

It is not a problem for tree based algorithms though.

Code

BTW: Factorizer is also applicable for binary string variables transformation. ‘Male’ / ‘Female’ in the ‘Sex’ column.

FrequencyEncoding

Logic

As the name infers — categorical variables are encoded by their frequency in the dataset. So basically this can be represented as their value counts.

print(train['Embarked'].value_counts())

>>> S    644
>>> C    168
>>> Q     77

train['Embarked'].value_counts()/len(train)

>>> S    0.722783
>>> C    0.188552
>>> Q    0.086420

But again one has to deal with NaNs, with potentially missing or additional categories in the test set, etc.

So we just use our secret weapon.

Such encoding is also suitable for most algorithms. It is worth a try among others and compare the validationi results.

MeanTargetEncoding

One of the more powerful techniques that represents each category as a mean of the target value for all the rows which are represented with this category. This option is suitable for training any ML algorithm.

# given that our target is 'Survived'
print(train.groupby('Embarked')['Survived'].mean())

>>> C    0.553571
>>> Q    0.389610
>>> S    0.336957

Not useful for multiclass classification problems at all — this method is way better for regression problems.

But even for a binary classification (from our example above we) we can see right away that the probability of ‘Survived’ == 1 is highest for the passengers that had Embarked aboard Titanic in port ‘C’.

This encoding type is a model of it’s own that tells us what the target should be (on average) for a given category. Plain application of means to categories will most definitely lead to overfitting the model. Thus some kind of smoothing (noize) must be applied to the encoded valus for the train set. Note: test set doesn’t need smoothing.

There are different encoding patterns for smoothed mean target encoding: adding random noize with a defined distribution, leave-one-out encoding, expanding mean encoding, etc… verstack uses k-fold encoding: train set is broken into 5 folds and each of the folds is encoded by means calculated on the other 4 folds.

And one feature I forgot to mentioned. With any of the encoders in this packageverstack can inverse transform train_encoded and test_encoded with a built in method. If for easier encoding methods like verstack.Factorizer it should not be a problem, then for OneHotEncoder which generates multiple columns, not to mention the more advanced methods which use noize generation techniques, doing this maniually is not a trivial task.

Code

WeightOrEvidenceEncoder

Logic

Powerful encoding option — suitable for binary classification problems only.

Categoric variable’s Weight of Evidence in regards to the binary target represents the probability (sort of) of the observation to belong to one of the classes (0 or 1). If encoded value is negative — it represents a category that is more heavily enclided to the negative target class (0) and positive WOE leans towards the positive class (1).

for each unique value of the feature consider the corresponding rows in the training set
compute what percentage of positives is in these rows, compared to the whole set
compute what percentage of negatives is in these rows, compared to the whole set
take the ratio of these percentages
take the natural logarithm of that ratio to get the weight of evidence corresponding to this feature. WOE is either positive or negative according to whether the feature category is more representative of positives or negatives

This strong technique also needs some kind of smoothing in order not to overfit the model to the train set.

Code

These methods will get you all set for dealing with most of the categoric variables out there. Try different encoding options, combine different encodings of the same variable in one dataset, validate and eliminate the ones that are less useful.

Detailed documentation for all of these methods is linked below.

And be sure to check out the rest of the tools verstack has to offer.

The package includes solutions to some day-to-day tasks that didn’t have convenient solutions before.

Current modules:

verstack.PandasOptimizer — automatic memory optimization when reading data into pandas. One-liner for 5-fold memory footprint reduction & significant training time decrease Medium article
verstack.FeatureSelector — automated feature selection tool based on quick recursive feature elimination by various ML models Medium article
verstack.Multicore — parallelise any function with a single line of code (by far the most popular tool) Medium article
verstack.Stacker — automated ensembling factory; create multilayer stacking ensembles with a few lines of code Medium article
verstack.DateParser — ultimate DateParser class that automatically finds and parses datetime feats from all the possible datetime formats in you dataframe Medium article
verstack.LGBMTuner — full blown LGBM tuning with with a single line of code (with optuna under the hood) Medium article
verstack.NaNImputer — impute all the NaN values by machine learning with a single line of code Medium article
verstack.ThreshTuner — automatic threshold selection for getting most out of the binary classification predicted probabilities Medium article
stratified_continuous_split — continuous data stratification Medium article
categoric encoders Factorizer OneHotEncoder FrequencyEncoder MeanTargetEncoder WeightOfEvidenceEncoder
timer — convenient timer to measure any function execution

Feel free to check it out.

Links

categoric encoders documentation

verstack documentation