avatarSatyam Kumar

Summary

The article advocates against using one-hot encoding for time-based cyclic features and recommends a trigonometry-based feature transformation method as a more effective alternative.

Abstract

The article titled "Stop One-Hot Encoding your Time-based Features" serves as an essential guide for transforming cyclic features in data science. It emphasizes that while one-hot encoding is suitable for categorical features, its application to time-based features like day of the month, day of the week, and day of the year is suboptimal due to the high dimensionality it introduces, potentially leading to the curse of dimensionality. The author proposes using sine and cosine functions to encode these cyclic features, which preserves their periodic nature without increasing dimensionality. This method results in a two-dimensional representation for each cyclic feature, as demonstrated with scatter plots for the day of the week, month, and year. The trigonometric approach is presented as a more elegant solution that maintains the integrity of the data and improves model performance.

Opinions

  • One-hot encoding is not recommended for features with a large number of unique values, especially cyclic time-based features, due to the risk of dimensionality issues.
  • The author suggests that data scientists should avoid one-hot encoding for cyclic features and instead use trigonometric transformations to better capture the periodic nature of the data.
  • The use of sine and cosine functions for encoding time-based features is seen as an efficient method to reduce the risk of overfitting and improve the model's ability to generalize from the data.
  • The article implies that a significant portion of a data scientist's time is spent on feature engineering, highlighting the importance of selecting appropriate encoding methods for model performance.
  • The author provides visual evidence through scatter plots to illustrate the effectiveness of the proposed trigonometric transformation method.
  • By advocating for the trigonometric approach, the author promotes a technique that is both more efficient in terms of dimensionality and more representative of the cyclic nature of time-based features.

Stop One-Hot Encoding your Time-based Features

Essential guide to feature transformation for cyclic features

Image by Sarah Lötscher from Pixabay

Feature Engineering is an essential component of the data science model development pipeline. A data scientist spends most of the time analyzing and preparing features to train a robust model. A raw dataset consists of various types of features including categorical, numerical, time-based features.

A machine learning or deep learning model understands only numerical vectors. The categorical and time-based features need to be encoded into the numerical format. There are various feature engineering strategies to encode categorical features include One-Hot Encoding, Count Vectorizer, and many more.

Time-based features include the day of month, day of week, day of year , time. Time-based features are cyclic or seasonal in nature. In this article, we will discuss why One-Hot encoding or dummy encoding should be avoided for cyclic features, instead discuss and implement a better and elegant solution.

Why NOT One-Hot Encoding?

One-hot Encoding is a feature encoding strategy to convert categorical features into a numerical vector. For each feature value, the one-hot transformation creates a new feature demarcating the presence or absence of feature value.

(Image by Author), One-hot encoding sample illustration

One-hot encoding creates d-dimensional vectors for each instance where d is the unique number of feature values in the dataset.

For a feature having a large number of unique feature values or categories, one-hot encoding is not a great choice. There are various other techniques to encode the categorical (ordinal or nominal) features.

Read the below-mentioned article to get an understanding of several feature encoding strategies for categorical features:

Time-based features such as day of month, day of week, day of year, etc have a cyclic nature and have many feature values. One-hot encoding day of monthfeature results in 30 dimensionality vector, day of year results in 366 dimension vector. It’s not a great choice to one-hot encode these features, as it may lead to a curse of dimensionality.

Idea:

The elegant solution to encode these cyclic features can be using mathematical formulation and trigonometry. In this article, we will encode the cyclic features using the basic formulation of trigonometry, by computing the sin and cosine of the features.

day of week the feature has 7 unique feature values. Taking the sin and cosine of the feature values will create 2 dimensionality features.

(Image by Author), Computing Sin and Cosine of Day of Week feature

Now, instead of creating a 7-dimensionality feature vector using One-hot encoding, a 2-dimensional transformed feature vector will serve the purpose to represent the entire feature. Now, let's visualize the new 2-dimensional transformed feature vector with a scatterplot.

(Image by Author), Scatter plot for Sin and Cosine of Day of Week feature

The scatterplot clearly shows the cyclic nature of the day of week feature. The 7-feature values (from 0 to 6) are now encoded into a 2-dimensional vector.

The day of month and day of year are cyclic in nature and having 31 and 366 feature values respectively. One-hot encoding them will increase the dimensionality of the final dataset. So, using trigonometric transformation of the feature values will serve the purpose of encoding the categorical features.

(Image by Author), Left: Computing Sin and Cosine, Right: Scatter plot for Sin and Cosine; of Day of Month feature
(Image by Author), Left: Computing Sin and Cosine, Right: Scatter plot for Sin and Cosine; of Day of Year feature

Conclusion:

The discussed trigonometry-based feature transformation can be implemented on any of the cyclical occurring features. One Hot Encoding works well with a relatively small amount of categorical values but it’s not recommended to one-hot encode features having many feature values or categories.

Read my previous article of feature encoding:

References:

[1] Scikit-learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Thank You for Reading

Artificial Intelligence
Machine Learning
Data Science
Python
Education
Recommended from ReadMedium