avatarKamala Kanta MISHRA (Kamal)

Summary

The article provides an in-depth exploration of encoding techniques in the context of feature engineering for machine learning, emphasizing the importance of converting categorical data into a format that algorithms can process.

Abstract

The article is part of a series on Exploratory Data Analysis (EDA) and Feature Engineering, with a focus on encoding techniques. It underscores the critical role of data preprocessing in the CRISP-DM methodology, particularly during the Data Understanding and Data Preparation phases. The author discusses various encoding methods, including One Hot Encoding, Label Encoding, Ordinal Encoding, Helmert Encoding, and Binary Encoding, and provides practical examples using Python. The article aims to guide data scientists on choosing appropriate encoding techniques based on the data and the machine learning task at hand. It also includes illustrations and code snippets to demonstrate the application of these techniques in a use case involving server data and chiller temperature predictions.

Opinions

  • The author believes that feature engineering, especially encoding, is a vital step in the data science process, as it enables machine learning algorithms to interpret and process categorical data.
  • The article suggests that the choice of encoding technique should be informed by the nature of the categorical data (nominal, ordinal) and the specific requirements of the machine learning model being used.
  • One Hot Encoding is recommended for scenarios where the categorical variable has a significant number of categories, and the model benefits from additional features.
  • Label Encoding is seen as a straightforward approach to convert categorical data into numerical form, though it may not always preserve the ordinal nature of the data.
  • Ordinal Encoding is preferred when the categories have a natural order that should be maintained in the encoding process.
  • Helmert Encoding is suggested for ordered categorical data, where the mean of the dependent variable for a level is compared to the mean of all previous levels.
  • Binary Encoding is proposed as an efficient method for handling high cardinality categorical features, serving as a middle ground between One Hot and hash encoding by reducing dimensionality.
  • The author emphasizes the practical utility of these encoding techniques through examples and encourages readers to engage with the provided Python implementation for a deeper understanding.
  • The article concludes with a disclaimer that the insights shared are based on the author's personal experiences and analysis, inviting feedback and further discussion on the topic.

EDA and Feature Engg Series: Encoding

Exploratory Data Analysis and Feature Engineering Series: Encoding techniques with multiple approaches in practical use case considerations

The CRISP-DM methodology in any Data Science program involves key stages around the “Data Understanding” and “Data Preparation” phases. We spend the majority of our time and effort during these stages. This series is in continuation with other methods that I have published.

Please refer here for the post around “handling missing values” and here for the post around “handling outliers”.

In this article, we will focus on encoding techniques. Feature Engineering is an extremely critical step in the Data Science process and encoding is very useful as part of the data pre-processing. Handling non-numeric data for use by machine learning algorithms is not something we can avoid and present its own set of challenges.

Image by Author
Image by Author

We will attempt to focus on the following:

· What is Encoding?

· Different types of Encoding techniques

· Consider a few of them such as One Hot Encoding, Label Encoding, Ordinal Encoding, Helmert Encoding, Binary Encoding

· Encoding overall approach — when to use what?

A) What is Encoding:

Encoding helps us convert different types of data such as nominal (e.g. Green, Yellow, Blue, Red, etc, or Dog, Cat, Cow, etc), ordinal (e.g. High, Medium, Low, etc) to computer/machine interpreted data formats (which is eventually 1 or 0 formats) that the machine understands. The machine learning algorithms operate on mathematical vectors with little context and therefore it is required to convert categorical variables to numeric format.

Image by Author

We will see various types of encoding techniques and how to use them.

B) Different Types of Encoding:

There are many different types of Encoding. I have attempted to illustrate some of the important types such as One Hot Encoding, Label Encoding, Ordinal Encoding, Helmert Encoding, WoE (Weight of Evidence) Encoding, Leave-One-Out Encoding, James-Stein Encoding, M-Estimator Encoding, Binary Encoding, etc.

Image by Author

We will try to pick five of these and understand them from illustrations with examples using python.

C) Example of a few Encoding techniques:

We will focus on five of them with an example to compare.

  1. One Hot Encoding:

One Hot Encoding is a very popular method. It is used to split each of the categories into an additional column. It increases dimensionality as part of encoding since it creates additional columns/attributes for various categories.

We have considered an example of servers, their respective chiller temperatures, and model type. The label is the target feature which is either 1 or 0. The goal is to predict chiller temperatures by formulating a machine learning predictive model.

Image by Author

It creates N different columns each for a category and replaces one column with 1 rest of the columns are 0.

  • For Regression, we use N-1 (drop the first or last column of One Hot Coded new feature) columns.
  • For classification, the recommendation is to use all N columns without dropping any column as most of the tree-based algorithm builds a tree based on all available variables.
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

# Create sample data in the dataframe
my_data = {'Id':[7493,7494,7495,7496,7497,7498,7499,7500,7501,7502],
           'Chiller_Temp':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Hot'],
           'Model':['RX','YX','BX','BX','RX','YX','RX','BX','YX','YX'],
           'Target':[1,1,1,0,1,0,1,0,1,1]}
df2 = pd.DataFrame(my_data)
df2
Image by Author
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore') # Creating instance of one-hot-encoder

ohet = ohe.fit_transform(df2.Chiller_Temp.values.reshape(-1,1)).toarray()

dfOneHot = pd.DataFrame(ohet, columns=["ChillerTemp_" + str(ohe.categories_[0][i])
                           for i in range(len(ohe.categories_[0]))])
dfh = pd.concat([df2, dfOneHot], axis=1)

dfh
Image by Author

The linear regression has access to all of the features as it is being trained and hence examines the whole set of dummy variables altogether. This means that N-1 binary variables give complete information about the original categorical variable to the linear regression. This approach can be adopted for any machine learning algorithm that looks at all the features simultaneously during training — for example, support vector machines and neural networks as well as clustering algorithms.​

2. Label Encoding:

It is used to transform categorical features into numerical features by assigning a numerical value to each of the categories. If we consider the same example of our Servers and their chiller temperature, we can do the following:

Image by Author

It will not affect dimensionality much as it does not create additional features for each category. In this example above, [Hot, Cold, Very Hot, Warm] will be encoded to [1, 0, 2, 3].

It can be used for ordinal variables as well, however, it does not mandate following the sequence. Therefore, it may not preserve the ordinal sequence.

There could be many examples and a few are illustrated below.

Image by Author

The code example is illustrated below.

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

# Create sample data in the dataframe
my_data = {'Id':['AIX-7493','AIX-7494','AIX-7495','AIX-7496','AIX-7497','AIX-7498','AIX-7499','AIX-7500','AIX-7501','AIX-7502'],
           'Chiller_Temp':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Hot'],
           'Model':['RX','YX','BX','BX','RX','YX','RX','BX','YX','YX'],
           'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(my_data)
df
Image by Author
# Creating instance of labelencoder
labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
df['ChillerTempEncoded'] = labelencoder.fit_transform(df['Chiller_Temp'])
df
Image by Author

3. Ordinal Encoding:

It is used to transform categorical features into numerical features by assigning a numerical value to each of the categories. It preserves the ordinal sequence.

Image by Author

A few examples are illustrated below.

Image by Author

4. Helmert Encoding:

It increases dimensionality (like One Hot Encoding). We have taken the same example of servers with different level chiller temperatures.

In this technique, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.

Image by Author

It can be solved in matrix form using RREF (Reduced Row-Echelon Form). A matrix is in RREF (reduced row-echelon form) if it satisfies the following: In each row, the left-most nonzero entry is 1 and the column that contains this 1 has all other entries equal to 0. This 1 is called a leading 1. You can check on RREF here.

This method is typically useful when the levels of categorical feature (e.g. Chiller_Temp in the above example) are ordered in a meaningful way.

Please refer here for sklearn usage considering category_encoders.

classcategory_encoders.helmert.HelmertEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')

5. Binary Encoding:

Image by Author

The binary encoding technique is a combination of One Hot encoding and hash encoding. When to use — when there is a high number of categories, it would be a good idea to use this method. It acts as a kind of dimensionality reduction as well since it uses fewer features compared to One Hot encoding.

For example — if there are 40–50 models in the server category and we would like to encode them, then it will be a good idea to use binary encoding. Another example — in the airline analysis, we are trying to explore and analyze source to destination flights and operations effectiveness from a specific source airport. Since the number of destinations would be lots of cities, we can use binary encoding instead of one hot encoding to minimize creating too many additional features.

D) Encoding overall approach — when to use what?

Here is a summary of encoding techniques based on certain situations.

Image by Author

That’s all for now. I will come back with more techniques as part of the “EDA and Feature Engineering” series. Please feel free to provide your valuable feedback or comments and clap if you like and if there is some value for you.

The python implementation file can be referred to here.

Disclaimer: The postings here are a personal point of view from my experiences, analysis, thoughts, and readings from various sources.

Encoding
Feature Engineering
Machine Learning
One Hot Encoding
Eda
Recommended from ReadMedium