Categorical Data Encoding Techniques in Python: A Complete Guide

Categorical data encoding is an important step in preparing data for machine learning algorithms. Categorical data refers to data that represents non-numerical values, such as colors, types of fruits, or gender. In order to use categorical data in machine learning models, it needs to be encoded as numerical values. In this tutorial, we will explore various techniques for categorical data encoding in Python.

We will be using the scikit-learn library for our examples. Scikit-learn is a popular library for machine learning in Python.

Let’s start by importing the necessary libraries:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

We will be using the following dataset for our examples. This dataset contains information about different types of fruits and their characteristics.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

df = pd.DataFrame(data)

print(df)

The above print statement will show the DataFrame:

    Fruit   Color  Price  Weight
0   Apple     Red   0.50     100
1  Banana  Yellow   0.25     120
2  Orange  Orange   0.30      80
3   Apple   Green   0.60     110
4  Banana  Yellow   0.35     130
5  Orange  Orange   0.40      90

Our dataset has two categorical columns: ‘Fruit’ and ‘Color’. We will explore different encoding techniques for these columns.

Get an email whenever Dr. Soumen Atta, Ph.D. publishes.

Get an email whenever Dr. Soumen Atta, Ph.D. publishes. By signing up, you will create a Medium account if you don't…

soumenatta.medium.com

Label Encoding

Label encoding is a technique for encoding categorical data into numerical data by assigning each category a unique integer value. Label encoding can be done using the LabelEncoder class from scikit-learn.

le = LabelEncoder()

df['Fruit'] = le.fit_transform(df['Fruit'])
df['Color'] = le.fit_transform(df['Color'])

print(df)

The above print statement will show the modified DataFrame:

   Fruit  Color  Price  Weight
0      0      2   0.50     100
1      1      3   0.25     120
2      2      1   0.30      80
3      0      0   0.60     110
4      1      3   0.35     130
5      2      1   0.40      90

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using the LabelEncoder class. The ‘Fruit’ column now has values 0, 1, and 2 corresponding to Apple, Banana, and Orange respectively. The ‘Color’ column now has values 0, 1, 2, and 3 corresponding to Green, Orange, Red, and Yellow respectively.

A Beginner’s Guide to Plotting in MATLAB: Tips and Tricks

This tutorial is an introduction to plotting in MATLAB. Data visualization is a crucial tool for understanding and…

blog.devgenius.io

One-Hot Encoding

One-hot encoding is a technique for encoding categorical data into numerical data by creating a binary vector for each category. Each vector has a 1 in the position corresponding to the category and 0s in all other positions. One-hot encoding can be done using the OneHotEncoder class from scikit-learn.

Again, we will use the same data.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

One-hot encoding can be done in the following way:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

# create a union of categories for both 'Fruit' and 'Color'
categories = np.union1d(df['Fruit'].unique(), df['Color'].unique())

# instantiate OneHotEncoder with the union of categories
onehot = OneHotEncoder(categories=[categories])

# fit and transform 'Fruit' column
onehot_fruit = pd.DataFrame(onehot.fit_transform(df[['Fruit']]).toarray(),
                             columns=[f'fruit_{i}' for i in categories])

# fit and transform 'Color' column
onehot_color = pd.DataFrame(onehot.fit_transform(df[['Color']]).toarray(),
                             columns=[f'color_{i}' for i in categories])

# concatenate the original DataFrame with the one-hot encoded columns
df = pd.concat([df, onehot_fruit, onehot_color], axis=1)

# drop the original 'Fruit' and 'Color' columns
df = df.drop(['Fruit', 'Color'], axis=1)

print(df)

The output of the above print statement is shown below:

   Price  Weight  fruit_Apple  fruit_Banana  fruit_Green  fruit_Orange   
0   0.50     100          1.0           0.0          0.0           0.0  \
1   0.25     120          0.0           1.0          0.0           0.0   
2   0.30      80          0.0           0.0          0.0           1.0   
3   0.60     110          1.0           0.0          0.0           0.0   
4   0.35     130          0.0           1.0          0.0           0.0   
5   0.40      90          0.0           0.0          0.0           1.0   

   fruit_Red  fruit_Yellow  color_Apple  color_Banana  color_Green   
0        0.0           0.0          0.0           0.0          0.0  \
1        0.0           0.0          0.0           0.0          0.0   
2        0.0           0.0          0.0           0.0          0.0   
3        0.0           0.0          0.0           0.0          1.0   
4        0.0           0.0          0.0           0.0          0.0   
5        0.0           0.0          0.0           0.0          0.0   

   color_Orange  color_Red  color_Yellow  
0           0.0        1.0           0.0  
1           0.0        0.0           1.0  
2           1.0        0.0           0.0  
3           0.0        0.0           0.0  
4           0.0        0.0           1.0  
5           1.0        0.0           0.0

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using one-hot encoding. We have used the OneHotEncoder class to create binary vectors for each category in the ‘Fruit’ and ‘Color’ columns. The above code snippet works as follows:

The code aims to perform one-hot encoding on the categorical columns ‘Fruit’ and ‘Color’ of a given Pandas DataFrame ‘df’.

Here is what the code does:

Import necessary libraries: ‘numpy’ and ‘OneHotEncoder’ from the ‘sklearn.preprocessing’ module.
Combine unique values of the ‘Fruit’ and ‘Color’ columns into a single array using ‘numpy.union1d()’. This creates a list of all possible categories for both columns.
Instantiate an instance of ‘OneHotEncoder’ with the categories list created in the previous step.
Fit and transform the ‘Fruit’ column using the ‘fit_transform()’ method of the ‘OneHotEncoder’ instance. This creates a one-hot encoded representation of the ‘Fruit’ column and stores it in a new DataFrame called ‘onehot_fruit’. The column names of the one-hot encoded DataFrame are set using the ‘categories’ list created in Step 2.
Fit and transform the ‘Color’ column using the same ‘OneHotEncoder’ instance. This creates a one-hot encoded representation of the ‘Color’ column and stores it in a new DataFrame called ‘onehot_color’. The column names of the one-hot encoded DataFrame are set using the ‘categories’ list created in Step 2.
Concatenate the original DataFrame ‘df’ with the one-hot encoded ‘Fruit’ and ‘Color’ DataFrames (‘onehot_fruit’ and ‘onehot_color’) along the columns axis using the ‘concat()’ method of Pandas DataFrame.
Drop the original ‘Fruit’ and ‘Color’ columns from the DataFrame ‘df’ using the ‘drop()’ method of Pandas DataFrame.
Print the final DataFrame ‘df’ after one-hot encoding.

The resulting DataFrame ‘df’ will have one-hot encoded columns for both ‘Fruit’ and ‘Color’ columns, and the original columns will be dropped.

Getting Started with Simple and Multiple Linear Regression in MATLAB

Simple Linear Regression (SLR) and Multiple Linear Regression (MLR) are two popular statistical models used for…

soumenatta.medium.com

Count Encoding

Count encoding is a technique for encoding categorical data into numerical data by replacing each category with the number of times it appears in the dataset. Count encoding can be done using the CountVectorizer class from scikit-learn.

Again, we will start with the same data.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

Count Encoding can be achieved in the following way:

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

# Perform count vectorization on the 'Fruit' column of the DataFrame 'df'
cv_fruit = CountVectorizer()
cv_fruit.fit(df['Fruit'])
count_fruit = pd.DataFrame(cv_fruit.transform(df['Fruit']).toarray(), columns=cv_fruit.get_feature_names_out())

# Perform count vectorization on the 'Color' column of the DataFrame 'df'
cv_color = CountVectorizer()
cv_color.fit(df['Color'])
count_color = pd.DataFrame(cv_color.transform(df['Color']).toarray(), columns=cv_color.get_feature_names_out())

# Concatenate the original DataFrame with the one-hot encoded columns
df = pd.concat([df, count_fruit, count_color], axis=1)

# Drop the original 'Fruit' and 'Color' columns
df = df.drop(['Fruit', 'Color'], axis=1)

print(df)

The output of the above print statement is shown below:

   Price  Weight  apple  banana  orange  green  orange  red  yellow
0   0.50     100      1       0       0      0       0    1       0
1   0.25     120      0       1       0      0       0    0       1
2   0.30      80      0       0       1      0       1    0       0
3   0.60     110      1       0       0      1       0    0       0
4   0.35     130      0       1       0      0       0    0       1
5   0.40      90      0       0       1      0       1    0       0

The above code is performing count vectorization on two categorical variables, ‘Fruit’ and ‘Color’, in a Pandas DataFrame ‘df’ using the scikit-learn CountVectorizer class.

First, a new Pandas DataFrame ‘df’ is created using a dictionary ‘data’. Then, two instances of CountVectorizer, ‘cv_fruit’ and ‘cv_color’, are created to perform count vectorization on ‘Fruit’ and ‘Color’ columns, respectively.

The fit method of CountVectorizer is called on each column to learn the vocabulary of the corpus, i.e., the unique values in each column. Then, the transform method is called on each column to convert the text data into a sparse matrix of token counts. The toarray method is then used to convert the sparse matrix to a dense matrix of token counts.

The output dense matrices are then converted to DataFrames, ‘count_fruit’ and ‘count_color’, using the feature names output by get_feature_names_out() method of the CountVectorizer objects.

Finally, the original DataFrame ‘df’ is concatenated with ‘count_fruit’ and ‘count_color’ using pd.concat() method. The original ‘Fruit’ and ‘Color’ columns are then dropped using the drop() method of the DataFrame.

A Beginner’s Guide to Data Visualization with Seaborn in Python

In this tutorial, we will be discussing the sns package in Python, which is a popular data visualization library built…

soumenatta.medium.com

Target Encoding

Target encoding is a technique for encoding categorical data into numerical data by replacing each category with the mean of the target variable for that category. Target encoding can be done using the TargetEncoder class from the category_encoders library.

You may need to install the following before going ahead:

pip install --upgrade category_encoders

We will again use the same data.

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

Target encoding can be done as follows:

from category_encoders import TargetEncoder

te = TargetEncoder()

df['Fruit'] = te.fit_transform(df['Fruit'], df['Price'])
df['Color'] = te.fit_transform(df['Color'], df['Price'])

print(df)

The output of the above print statement is shown below:

      Fruit     Color  Price  Weight
0  0.421278  0.413011   0.50     100
1  0.385815  0.385815   0.25     120
2  0.392907  0.392907   0.30      80
3  0.421278  0.426022   0.60     110
4  0.385815  0.385815   0.35     130
5  0.392907  0.392907   0.40      90

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using target encoding. We have used the TargetEncoder class to calculate the mean of the target variable (‘Price’) for each category in the ‘Fruit’ and ‘Color’ columns. We have then replaced each category with its respective mean in the ‘Fruit’ and ‘Color’ columns.

EDA with Python: A Step-by-Step Guide to Discovering and Visualizing Your Data

Exploratory Data Analysis (EDA) is an important step in any data analysis project. It involves summarizing and…

blog.devgenius.io

Label Encoding

Label encoding is a technique for encoding categorical data into numerical data by replacing each category with a unique integer. Label encoding can be done using the LabelEncoder class from scikit-learn.

We will use the following data:

data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana', 'Orange'],
        'Color': ['Red', 'Yellow', 'Orange', 'Green', 'Yellow', 'Orange'],
        'Price': [0.5, 0.25, 0.3, 0.6, 0.35, 0.4],
        'Weight': [100, 120, 80, 110, 130, 90]}

# Create a new DataFrame using the dictionary 'data'
df = pd.DataFrame(data)

Now, we perform label encoding on the above data.

le = LabelEncoder()

df['Fruit'] = le.fit_transform(df['Fruit'])
df['Color'] = le.fit_transform(df['Color'])

print(df)

The above print statement will show the following output:

   Fruit  Color  Price  Weight
0      0      2   0.50     100
1      1      3   0.25     120
2      2      1   0.30      80
3      0      0   0.60     110
4      1      3   0.35     130
5      2      1   0.40      90

In the above example, we have encoded the ‘Fruit’ and ‘Color’ columns using label encoding. We have used the LabelEncoder class to replace each category in the ‘Fruit’ and ‘Color’ columns with a unique integer.

Mean Shift Clustering: A Non-Parametric Clustering Technique for Unsupervised Learning

Mean shift clustering is a popular unsupervised machine learning technique for clustering data points. It is a…

soumenatta.medium.com

Conclusion

In this tutorial, we have discussed four techniques for encoding categorical data into numerical data in Python: one-hot encoding, count encoding, target encoding, and label encoding. Each technique has its own advantages and disadvantages, and the choice of technique depends on the specific requirements of the problem at hand. It is important to choose the appropriate encoding technique to ensure the accuracy and reliability of the results.

Join Medium with my referral link - Dr. Soumen Atta, Ph.D.

Read every story from the thousands of writers on Medium. Become a member now! Your membership fee directly supports…

soumenatta.medium.com

Implementing Decision Tree Algorithm for Classification with Titanic Dataset in Python

Decision Trees are a popular machine learning algorithm used for classification and regression tasks. In this tutorial…

soumenatta.medium.com

Additional readings

Interested readers can read the following related tutorials on various categorical data encoding techniques:

Encoding Categorical Variables with Label Encoding in Python

Categorical variables are variables that can take on a limited number of values. They are often used in machine…

blog.devgenius.io

How to Encode Categorical Variables with Target Encoding in Python for Machine Learning

Categorical variables are commonly found in datasets and are used to represent data in categories. These variables need…

soumenatta.medium.com

How to Use Binary Encoding to Handle Categorical Variables in Machine Learning

In machine learning, categorical variables are those that take on a limited number of discrete values, such as color…

blog.devgenius.io

Encoding Categorical Variables with One-Hot Encoding in Python

One-hot encoding is a popular technique used for encoding categorical variables into numerical values that can be used…

soumenatta.medium.com

Mastering Count Encoding for Categorical Variables in Python

Categorical variables are variables that take on values from a limited set of categories, such as color, gender, or…

soumenatta.medium.com