Label Encoding in Python — Machine Learning
In this tutorial, we will discuss label encoding in Python.
Table of Contents:
- Introduction
- Label encoding explained
- Advantages and disadvantages
- Label encoding in Python
- Conclusion
Introduction
In data science, we often work with datasets that contain categorical variables, where the values are represented by strings. For example, when we work with datasets for salary estimation based on different sets of features, we often see job titles being entered in words, for example, Manager, Director, Vice-President, President, and so on. The complication it creates is the fact that machine learning algorithms in fact can work with categorical features, yet they have to be in numeric form.
There are multiple ways to solve this problem and a lot depends on the algorithm you will be working with. And how sensitive it is to the ranges and distributions of numerical features.
Two of the most common approaches are:
Both techniques allow for conversion from categorical/text data to numeric format. These are valid solutions with their own benefits and costs. In this article, we will focus on label encoding and its variations. We will also outline cases when it should/shouldn’t be applied.
To continue following this tutorial we will need the following two Python libraries: sklearn and pandas. If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:
pip install sklearn
pip install pandasLabel encoding explained
To get a sense of how label encoding works, let’s take a look at the following dataset:

Assume it is the data that we would like to feed into some machine learning algorithm. Every row represents a position that an individual holds and the corresponding annual salary.
The “Position” feature is all text and it is what we will need to convert into a model-friendly numeric format.
The question that arises is how do we assign numeric values to text categorical data?
Note: for this article consider the range of the numbers we can assign between 0 and +∞ with 0 being the smallest number.
Here are a few possible ways we can assign numeric values to the “Position” feature:
Option 1: Using current order

Option 2: Using alphabetical order

Option 3: Using “Salary” feature order

Which one is correct to use?
Well, there is no definite answer. It all depends. Depends on the algorithm you are going to feed these features to. It also really depends on your dataset.
For example, if you are going to use simple linear regression (OLS) to estimate an individual’s salary as a function of their position, you should only use option 3. Here is why: when you convert this feature to a numerical format, the algorithm doesn’t understand the structure of your hierarchy. It now treats everything as numbers and it safely assumes the following: 0 < 1 < 2 < 3.
If you followed option 1, then the algorithm will see the position of “Assistant Manager” being superior to “Manager”. If you followed option 2, then the algorithm will see the position of “Customer Service” being superior to “Assistant Manager”.
This is not what you want to have right? As it can provide misleading estimates and cause false predictions (which can potentially be statistically significant) but only because the algorithm treats the numeric feature sequentially from smallest to largest.
Now, the interesting part comes when you decide to implement this in python. There are a handful of ways to achieve the same result, and we will discuss a few of them below.
Advantages and disadvantages of label encoding
It is important to understand the benefits and drawbacks of label encoding and also consider other available encoding techniques.
Advantages:
- It is easy to implement and interpret
- It is visually user friendly
- It works best with a smaller number of unique categorical values
Disadvantages:
- It can skew the estimation results if an algorithm is very sensitive to feature magnitude (like SVM). In such a case, you may consider standardizing or normalizing values after encoding.
- It can skew the estimation results if there is a large number of unique categorical values. In our case, it was 4, but if it’s 10 or more, you should keep this in mind. In such a case, you should look into other encoding techniques, for example, one hot encoding.
Label Encoding using Python
In this part, we will cover a few different ways of how to do label encoding in Python.
Two of the most popular approaches:
- LabelEncoder() from scikit-learn library
- pandas.factorize() from pandas library
Once the libraries are downloaded and installed, we can proceed with Python code implementation.






