Understanding Ordinal Encoding
A Guide to Handling Categorical Variables in Machine Learning
In the realm of machine learning, data comes in various forms and types, including both numerical and categorical variables. While numerical data is straightforward to handle, categorical data presents unique challenges, particularly when building predictive models. Ordinal encoding is a technique used to address these challenges, especially when dealing with categorical variables with a natural order or hierarchy.
In this article, we’ll delve into the concept of ordinal encoding, its significance, implementation, and best practices in machine learning.
What is Ordinal Encoding?
Ordinal encoding is a method used to transform categorical variables into numerical representations, preserving the inherent order or hierarchy present in the categories. Unlike one-hot encoding, which creates binary columns for each category, ordinal encoding assigns integer values to categories based on their natural order.
Consider a categorical variable “Size” with categories: “Small,” “Medium,” and “Large.” Through ordinal encoding, these categories might be transformed into numerical values: 0 for “Small,” 1 for “Medium,” and 2 for “Large.” This encoding implicitly captures the ordinal relationship between the categories.
When to Use Ordinal Encoding?
Ordinal encoding is suitable for categorical variables that exhibit a clear order or ranking among their categories. Such variables include “Low,” “Medium,” and “High” for levels of risk, “Good,” “Better,” and “Best” for performance ratings, or “Cold,” “Warm,” and “Hot” for temperature ranges. Utilizing ordinal encoding in these scenarios ensures that the resulting numerical representation preserves the meaningful order of the categories.
Implementation of Ordinal Encoding
In Python, the popular libraries scikit-learn and category_encoders provide efficient tools for ordinal encoding.
Let’s illustrate how to perform ordinal encoding using scikit-learn:
from sklearn.preprocessing import OrdinalEncoder
# Sample data
data = [['Small'], ['Medium'], ['Large']]
# Initialize OrdinalEncoder
encoder = OrdinalEncoder()
# Fit and transform the data
encoded_data = encoder.fit_transform(data)
print(encoded_data)
The resulting encoded data would be:
[[0.]
[1.]
[2.]]If you have multiple columns on which you need to apply ordinal encoding, you can handle each column separately and apply ordinal encoding independently to each one. Here’s how you can do it using Python and the scikit-learn library:
from sklearn.preprocessing import OrdinalEncoder
# Sample data
data = [['Low', 'Cold'], ['Medium', 'Warm'], ['High', 'Hot']]
# Columns to encode
columns_to_encode = [0, 1] # Index of columns to encode
# Initialize OrdinalEncoder
encoder = OrdinalEncoder()
# Fit and transform the data for each column
encoded_data = data.copy() # Copy the original data
for column_idx in columns_to_encode:
column_data = [[row[column_idx]] for row in data] # Extract column data
encoded_column_data = encoder.fit_transform(column_data) # Apply ordinal encoding
for i, row in enumerate(encoded_data):
row[column_idx] = encoded_column_data[i][0] # Update original data with encoded values
print("Encoded Data:")
for row in encoded_data:
print(row)
##Output
Encoded Data:
[0.0, 0.0]
[1.0, 1.0]
[2.0, 2.0]In this example, columns_to_encode contains the indices of the columns you want to encode. The script then iterates over these columns, extracts the column data, applies ordinal encoding using OrdinalEncoder, and updates the original data with the encoded values.
Alternatively, you can use the category_encoders library, which provides a more convenient way to handle encoding for multiple columns:
import category_encoders as ce
# Sample data
data = [['Low', 'Cold'], ['Medium', 'Warm'], ['High', 'Hot']]
# Columns to encode
columns_to_encode = [0, 1] # Index of columns to encode
# Initialize OrdinalEncoder
encoder = ce.OrdinalEncoder(cols=columns_to_encode)
# Fit and transform the data
encoded_data = encoder.fit_transform(data)
print("Encoded Data:")
print(encoded_data)
##Output
Encoded Data:
[0.0, 0.0]
[1.0, 1.0]
[2.0, 2.0]In this case, you specify the indices of the columns to encode using the cols parameter of OrdinalEncoder from category_encoders. This approach simplifies the process by handling multiple columns at once.
Best Practices and Considerations
While ordinal encoding offers a straightforward approach to handle ordinal categorical variables, it’s essential to consider some best practices:
- Ordinality Preservation: Ensure that the assigned integer values reflect the true ordinal relationship among the categories. Incorrect encoding may lead to misleading interpretations by the model.
- Impact on Algorithms: Some machine learning algorithms may misinterpret ordinal encoded variables as continuous or imply a numerical relationship between the encoded values. It’s crucial to choose algorithms that can handle ordinal encoded features appropriately.
- Handling Unknown Categories: Ordinal encoding may encounter unknown categories in the test data that were not present during training. Ensure robust handling of such scenarios to avoid runtime errors.
- Consideration for One-Hot Encoding: For categorical variables without a natural order, one-hot encoding may be more suitable to represent the categories as binary columns.
Conclusion
Ordinal encoding provides a practical solution for transforming categorical variables with inherent order or hierarchy into numerical representations. By preserving the ordinal relationships among categories, ordinal encoding facilitates the utilization of such variables in machine learning models effectively. However, it’s essential to apply ordinal encoding judiciously, considering the characteristics of the data and the requirements of the modeling task.
In summary, understanding and correctly applying ordinal encoding empower data scientists and machine learning practitioners to handle categorical variables seamlessly, contributing to the development of robust and accurate predictive models.



