avatarDr. Soumen Atta, Ph.D.

Summary

The provided content is a comprehensive guide on various categorical data encoding techniques in Python, including label encoding, one-hot encoding, count encoding, and target encoding, with examples using the scikit-learn library and the category_encoders library.

Abstract

The article "Categorical Data Encoding Techniques in Python: A Complete Guide" offers an in-depth exploration of methods to convert categorical data into a numerical format suitable for machine learning algorithms. It begins by emphasizing the necessity of encoding non-numerical data, such as colors or types of fruits, into numerical values to be processed by machine learning models. The guide uses a sample dataset containing information about different fruits to demonstrate encoding techniques. It covers label encoding, which assigns unique integers to categories; one-hot encoding, which creates binary vectors; count encoding, which replaces categories with their occurrence count; and target encoding, which uses the mean of the target variable for encoding. The tutorial provides step-by-step Python code using libraries like pandas, scikit-learn, and category_encoders, and concludes by discussing the merits and appropriate use cases for each encoding method.

Opinions

  • The author suggests that the choice of encoding technique should be informed by the specific requirements of the machine learning problem at hand.
  • The article implies that label encoding is straightforward but may introduce an ordinal relationship that does not exist in the data.
  • One-hot encoding is presented as a method that avoids implying an ordinal relationship but can increase the dimensionality of the dataset.
  • Count encoding is described as a simple approach that reflects the frequency of categories, which can be useful for understanding category distribution.
  • Target encoding is highlighted as a powerful technique that incorporates information about the target variable, potentially leading to better model performance, but it may introduce target leakage if not used carefully.
  • The author encourages readers to explore additional related tutorials to further their understanding of categorical data encoding techniques.

Categorical Data Encoding Techniques in Python: A Complete Guide

Categorical data encoding is an important step in preparing data for machine learning algorithms. Categorical data refers to data that represents non-numerical values, such as colors, types of fruits, or gender. In order to use categorical data in machine learning models, it needs to be encoded as numerical values. In this tutorial, we will explore various techniques for categorical data encoding in Python.

We will be using the scikit-learn library for our examples. Scikit-learn is a popular library for machine learning in Python.

Let’s start by importing the necessary libraries:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

We will be using the following dataset for our examples. This dataset contains information about different types of fruits and their characteristics.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

df = pd.DataFrame(data)

print(df) 

The above print statement will show the DataFrame:

    Fruit   Color  Price  Weight
0   Apple     Red   0.50     100
1  Banana  Yellow   0.25     120
2  Orange  Orange   0.30      80
3   Apple   Green   0.60     110
4  Banana  Yellow   0.35     130
5  Orange  Orange   0.40      90

Our dataset has two categorical columns: ‘Fruit’ and ‘Color’. We will explore different encoding techniques for these columns.

Label Encoding

Label encoding is a technique for encoding categorical data into numerical data by assigning each category a unique integer value. Label encoding can be done using the LabelEncoder class from scikit-learn.

le = LabelEncoder()

df['Fruit'] = le.fit_transform(df['Fruit'])
df['Color'] = le.fit_transform(df['Color'])

print(df)

The above print statement will show the modified DataFrame:

   Fruit  Color  Price  Weight
0      0      2   0.50     100
1      1      3   0.25     120
2      2      1   0.30      80
3      0      0   0.60     110
4      1      3   0.35     130
5      2      1   0.40      90

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using the LabelEncoder class. The ‘Fruit’ column now has values 0, 1, and 2 corresponding to Apple, Banana, and Orange respectively. The ‘Color’ column now has values 0, 1, 2, and 3 corresponding to Green, Orange, Red, and Yellow respectively.

One-Hot Encoding

One-hot encoding is a technique for encoding categorical data into numerical data by creating a binary vector for each category. Each vector has a 1 in the position corresponding to the category and 0s in all other positions. One-hot encoding can be done using the OneHotEncoder class from scikit-learn.

Again, we will use the same data.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

One-hot encoding can be done in the following way:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

# create a union of categories for both 'Fruit' and 'Color'
categories = np.union1d(df['Fruit'].unique(), df['Color'].unique())

# instantiate OneHotEncoder with the union of categories
onehot = OneHotEncoder(categories=[categories])

# fit and transform 'Fruit' column
onehot_fruit = pd.DataFrame(onehot.fit_transform(df[['Fruit']]).toarray(),
                             columns=[f'fruit_{i}' for i in categories])

# fit and transform 'Color' column
onehot_color = pd.DataFrame(onehot.fit_transform(df[['Color']]).toarray(),
                             columns=[f'color_{i}' for i in categories])

# concatenate the original DataFrame with the one-hot encoded columns
df = pd.concat([df, onehot_fruit, onehot_color], axis=1)

# drop the original 'Fruit' and 'Color' columns
df = df.drop(['Fruit', 'Color'], axis=1)

print(df)

The output of the above print statement is shown below:

   Price  Weight  fruit_Apple  fruit_Banana  fruit_Green  fruit_Orange   
0   0.50     100          1.0           0.0          0.0           0.0  \
1   0.25     120          0.0           1.0          0.0           0.0   
2   0.30      80          0.0           0.0          0.0           1.0   
3   0.60     110          1.0           0.0          0.0           0.0   
4   0.35     130          0.0           1.0          0.0           0.0   
5   0.40      90          0.0           0.0          0.0           1.0   

   fruit_Red  fruit_Yellow  color_Apple  color_Banana  color_Green   
0        0.0           0.0          0.0           0.0          0.0  \
1        0.0           0.0          0.0           0.0          0.0   
2        0.0           0.0          0.0           0.0          0.0   
3        0.0           0.0          0.0           0.0          1.0   
4        0.0           0.0          0.0           0.0          0.0   
5        0.0           0.0          0.0           0.0          0.0   

   color_Orange  color_Red  color_Yellow  
0           0.0        1.0           0.0  
1           0.0        0.0           1.0  
2           1.0        0.0           0.0  
3           0.0        0.0           0.0  
4           0.0        0.0           1.0  
5           1.0        0.0           0.0  

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using one-hot encoding. We have used the OneHotEncoder class to create binary vectors for each category in the ‘Fruit’ and ‘Color’ columns. The above code snippet works as follows:

The code aims to perform one-hot encoding on the categorical columns ‘Fruit’ and ‘Color’ of a given Pandas DataFrame ‘df’.

Here is what the code does:

  1. Import necessary libraries: ‘numpy’ and ‘OneHotEncoder’ from the ‘sklearn.preprocessing’ module.
  2. Combine unique values of the ‘Fruit’ and ‘Color’ columns into a single array using ‘numpy.union1d()’. This creates a list of all possible categories for both columns.
  3. Instantiate an instance of ‘OneHotEncoder’ with the categories list created in the previous step.
  4. Fit and transform the ‘Fruit’ column using the ‘fit_transform()’ method of the ‘OneHotEncoder’ instance. This creates a one-hot encoded representation of the ‘Fruit’ column and stores it in a new DataFrame called ‘onehot_fruit’. The column names of the one-hot encoded DataFrame are set using the ‘categories’ list created in Step 2.
  5. Fit and transform the ‘Color’ column using the same ‘OneHotEncoder’ instance. This creates a one-hot encoded representation of the ‘Color’ column and stores it in a new DataFrame called ‘onehot_color’. The column names of the one-hot encoded DataFrame are set using the ‘categories’ list created in Step 2.
  6. Concatenate the original DataFrame ‘df’ with the one-hot encoded ‘Fruit’ and ‘Color’ DataFrames (‘onehot_fruit’ and ‘onehot_color’) along the columns axis using the ‘concat()’ method of Pandas DataFrame.
  7. Drop the original ‘Fruit’ and ‘Color’ columns from the DataFrame ‘df’ using the ‘drop()’ method of Pandas DataFrame.
  8. Print the final DataFrame ‘df’ after one-hot encoding.

The resulting DataFrame ‘df’ will have one-hot encoded columns for both ‘Fruit’ and ‘Color’ columns, and the original columns will be dropped.

Count Encoding

Count encoding is a technique for encoding categorical data into numerical data by replacing each category with the number of times it appears in the dataset. Count encoding can be done using the CountVectorizer class from scikit-learn.

Again, we will start with the same data.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

Count Encoding can be achieved in the following way:

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

# Perform count vectorization on the 'Fruit' column of the DataFrame 'df'
cv_fruit = CountVectorizer()
cv_fruit.fit(df['Fruit'])
count_fruit = pd.DataFrame(cv_fruit.transform(df['Fruit']).toarray(), columns=cv_fruit.get_feature_names_out())

# Perform count vectorization on the 'Color' column of the DataFrame 'df'
cv_color = CountVectorizer()
cv_color.fit(df['Color'])
count_color = pd.DataFrame(cv_color.transform(df['Color']).toarray(), columns=cv_color.get_feature_names_out())

# Concatenate the original DataFrame with the one-hot encoded columns
df = pd.concat([df, count_fruit, count_color], axis=1)

# Drop the original 'Fruit' and 'Color' columns
df = df.drop(['Fruit', 'Color'], axis=1)

print(df)

The output of the above print statement is shown below:

   Price  Weight  apple  banana  orange  green  orange  red  yellow
0   0.50     100      1       0       0      0       0    1       0
1   0.25     120      0       1       0      0       0    0       1
2   0.30      80      0       0       1      0       1    0       0
3   0.60     110      1       0       0      1       0    0       0
4   0.35     130      0       1       0      0       0    0       1
5   0.40      90      0       0       1      0       1    0       0

The above code is performing count vectorization on two categorical variables, ‘Fruit’ and ‘Color’, in a Pandas DataFrame ‘df’ using the scikit-learn CountVectorizer class.

First, a new Pandas DataFrame ‘df’ is created using a dictionary ‘data’. Then, two instances of CountVectorizer, ‘cv_fruit’ and ‘cv_color’, are created to perform count vectorization on ‘Fruit’ and ‘Color’ columns, respectively.

The fit method of CountVectorizer is called on each column to learn the vocabulary of the corpus, i.e., the unique values in each column. Then, the transform method is called on each column to convert the text data into a sparse matrix of token counts. The toarray method is then used to convert the sparse matrix to a dense matrix of token counts.

The output dense matrices are then converted to DataFrames, ‘count_fruit’ and ‘count_color’, using the feature names output by get_feature_names_out() method of the CountVectorizer objects.

Finally, the original DataFrame ‘df’ is concatenated with ‘count_fruit’ and ‘count_color’ using pd.concat() method. The original ‘Fruit’ and ‘Color’ columns are then dropped using the drop() method of the DataFrame.

Target Encoding

Target encoding is a technique for encoding categorical data into numerical data by replacing each category with the mean of the target variable for that category. Target encoding can be done using the TargetEncoder class from the category_encoders library.

You may need to install the following before going ahead:

pip install --upgrade category_encoders

We will again use the same data.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data) 

Target encoding can be done as follows:

from category_encoders import TargetEncoder

te = TargetEncoder()

df['Fruit'] = te.fit_transform(df['Fruit'], df['Price'])
df['Color'] = te.fit_transform(df['Color'], df['Price'])

print(df)

The output of the above print statement is shown below:

      Fruit     Color  Price  Weight
0  0.421278  0.413011   0.50     100
1  0.385815  0.385815   0.25     120
2  0.392907  0.392907   0.30      80
3  0.421278  0.426022   0.60     110
4  0.385815  0.385815   0.35     130
5  0.392907  0.392907   0.40      90

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using target encoding. We have used the TargetEncoder class to calculate the mean of the target variable (‘Price’) for each category in the ‘Fruit’ and ‘Color’ columns. We have then replaced each category with its respective mean in the ‘Fruit’ and ‘Color’ columns.

Label Encoding

Label encoding is a technique for encoding categorical data into numerical data by replacing each category with a unique integer. Label encoding can be done using the LabelEncoder class from scikit-learn.

We will use the following data:

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

Now, we perform label encoding on the above data.

le = LabelEncoder()

df['Fruit'] = le.fit_transform(df['Fruit'])
df['Color'] = le.fit_transform(df['Color'])

print(df)

The above print statement will show the following output:

   Fruit  Color  Price  Weight
0      0      2   0.50     100
1      1      3   0.25     120
2      2      1   0.30      80
3      0      0   0.60     110
4      1      3   0.35     130
5      2      1   0.40      90

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using label encoding. We have used the LabelEncoder class to replace each category in the ‘Fruit’ and ‘Color’ columns with a unique integer.

Conclusion

In this tutorial, we have discussed four techniques for encoding categorical data into numerical data in Python: one-hot encoding, count encoding, target encoding, and label encoding. Each technique has its own advantages and disadvantages, and the choice of technique depends on the specific requirements of the problem at hand. It is important to choose the appropriate encoding technique to ensure the accuracy and reliability of the results.

Additional readings

Interested readers can read the following related tutorials on various categorical data encoding techniques:

Python
Machine Learning
Data Preprocessing
Categorical Encoding
Data Science
Recommended from ReadMedium