avatarMisha Sv

Summary

This text provides a tutorial on label encoding in Python, a technique used in machine learning to convert categorical data into numerical data.

Abstract

The tutorial explains the concept of label encoding, its advantages and disadvantages, and provides a step-by-step guide on how to implement it in Python using two popular methods: LabelEncoder() from the scikit-learn library and pandas.factorize() from the pandas library. The tutorial also discusses the importance of the order in which numerical values are assigned to categorical data, and how it can affect the results of machine learning algorithms.

Opinions

  • Label encoding is a useful technique for converting categorical data into numerical data in machine learning.
  • The order in which numerical values are assigned to categorical data can have a significant impact on the results of machine learning algorithms.
  • Label encoding is easy to implement and interpret, but can skew estimation results if not used carefully.
  • The tutorial recommends using label encoding with a smaller number of unique categorical values and considering other encoding techniques for larger datasets.
  • The tutorial provides a clear and concise guide on how to implement label encoding in Python using two popular methods.

Label Encoding in Python — Machine Learning

In this tutorial, we will discuss label encoding in Python.

Table of Contents:

  • Introduction
  • Label encoding explained
  • Advantages and disadvantages
  • Label encoding in Python
  • Conclusion

Introduction

In data science, we often work with datasets that contain categorical variables, where the values are represented by strings. For example, when we work with datasets for salary estimation based on different sets of features, we often see job titles being entered in words, for example, Manager, Director, Vice-President, President, and so on. The complication it creates is the fact that machine learning algorithms in fact can work with categorical features, yet they have to be in numeric form.

There are multiple ways to solve this problem and a lot depends on the algorithm you will be working with. And how sensitive it is to the ranges and distributions of numerical features.

Two of the most common approaches are:

Both techniques allow for conversion from categorical/text data to numeric format. These are valid solutions with their own benefits and costs. In this article, we will focus on label encoding and its variations. We will also outline cases when it should/shouldn’t be applied.

To continue following this tutorial we will need the following two Python libraries: sklearn and pandas. If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:

pip install sklearn
pip install pandas

Label encoding explained

To get a sense of how label encoding works, let’s take a look at the following dataset:

Image by Author

Assume it is the data that we would like to feed into some machine learning algorithm. Every row represents a position that an individual holds and the corresponding annual salary.

The “Position” feature is all text and it is what we will need to convert into a model-friendly numeric format.

The question that arises is how do we assign numeric values to text categorical data?

Note: for this article consider the range of the numbers we can assign between 0 and +∞ with 0 being the smallest number.

Here are a few possible ways we can assign numeric values to the “Position” feature:

Option 1: Using current order

Image by Author

Option 2: Using alphabetical order

Image by Author

Option 3: Using “Salary” feature order

Image by Author

Which one is correct to use?

Well, there is no definite answer. It all depends. Depends on the algorithm you are going to feed these features to. It also really depends on your dataset.

For example, if you are going to use simple linear regression (OLS) to estimate an individual’s salary as a function of their position, you should only use option 3. Here is why: when you convert this feature to a numerical format, the algorithm doesn’t understand the structure of your hierarchy. It now treats everything as numbers and it safely assumes the following: 0 < 1 < 2 < 3.

If you followed option 1, then the algorithm will see the position of “Assistant Manager” being superior to “Manager”. If you followed option 2, then the algorithm will see the position of “Customer Service” being superior to “Assistant Manager”.

This is not what you want to have right? As it can provide misleading estimates and cause false predictions (which can potentially be statistically significant) but only because the algorithm treats the numeric feature sequentially from smallest to largest.

Now, the interesting part comes when you decide to implement this in python. There are a handful of ways to achieve the same result, and we will discuss a few of them below.

Advantages and disadvantages of label encoding

It is important to understand the benefits and drawbacks of label encoding and also consider other available encoding techniques.

Advantages:

  • It is easy to implement and interpret
  • It is visually user friendly
  • It works best with a smaller number of unique categorical values

Disadvantages:

  • It can skew the estimation results if an algorithm is very sensitive to feature magnitude (like SVM). In such a case, you may consider standardizing or normalizing values after encoding.
  • It can skew the estimation results if there is a large number of unique categorical values. In our case, it was 4, but if it’s 10 or more, you should keep this in mind. In such a case, you should look into other encoding techniques, for example, one hot encoding.

Label Encoding using Python

In this part, we will cover a few different ways of how to do label encoding in Python.

Two of the most popular approaches:

  • LabelEncoder() from scikit-learn library
  • pandas.factorize() from pandas library

Once the libraries are downloaded and installed, we can proceed with Python code implementation.

Step 1: Create a dataframe with the required data

First, we import pandas library as it will be required to create a pandas dataframe. Then we create a Python dictionary df and convert it to a dataframe.

Let’s take a look at the result:

Output:
            Position  Salary
0   Customer Service   44000
1            Manager   75000
2  Assistant Manager   65000
3           Director   90000

Step 2.1: Label encoding in Python using current order

We create a new feature “code” and assign categorical feature “ position “ in numerical format to it.

The sequence of numbers in “ code” by default follows the order of the original dataframe df:

Output:
            Position  Salary  code
0   Customer Service   44000     0
1            Manager   75000     1
2  Assistant Manager   65000     2
3           Director   90000     3

Step 2.2: Label encoding in Python using alphabetical order

This case is a little more interesting as we can achieve the same result using both of the methods mentioned earlier.

scikit-learn method

We will first import LabelEncoder() from the sci-kit learn library and define le as its instance. Then we will apply it to the “ Position” feature to convert it to numerical format and store it as a new feature “code”.

What’s interesting about this method is that by default LabelEncoder() orders values in alphabetical order without us having to specify anything.

Let’s take a look at what we arrived at:

Output:
            Position  Salary  code
0   Customer Service   44000     1
1            Manager   75000     3
2  Assistant Manager   65000     0
3           Director   90000     2

LabelEncoder() correctly order the values in the “Position feature and generated the corresponding numerical values in the following sequence: Assistant Manager, Customer Service, Director, Manager.

pandas method

What’s different from Step 2.1 where we worked with the original order, we added “sort=True” (alphabetically) parameter to identify that we need the conversion to numerical format of the sorted “Position feature.

Let’s take a look at the result:

Output:
            Position  Salary  code
0   Customer Service   44000     1
1            Manager   75000     3
2  Assistant Manager   65000     0
3           Director   90000     2

We can see that both the scikit-learn method and the pandas method generate the same result.

Step 2.3: Label encoding in Python using “Salary” feature order

As we discussed in the Understanding Label Encoding section, most likely this will be the most algorithm-friendly way to convert categorical features to numeric format.

In general, the majority of algorithms prefer some logic behind the numerical value assignment, that being sequence, hierarchy, or other. It will also make your results more valid and definitely scalable and interpretable.

Since we already know that the sequence of numbers in “code” by default follows the order of the original dataframe df (Step 2.1), what we will do first is sort the original df by “Salary” feature values and then convert “Position” feature to numerical format and store it as “code”.

Let’s take a look at the result:

Output:
            Position  Salary  code
0   Customer Service   44000     0
2  Assistant Manager   65000     1
1            Manager   75000     2
3           Director   90000     3

Conclusion

In this tutorial, we discussed how to perform label encoding in Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Machine Learning articles.

Originally published at https://pyshark.com on March 25, 2020.

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

Python
Machine Learning
Data Science
Scikit Learn
Programming
Recommended from ReadMedium