avatarRoman Orac

Summary

The article discusses a common pitfall with the MinMaxScaler in machine learning, where scaled values can fall outside the expected [0, 1] range when applied to new data, potentially leading to incorrect classifier behavior.

Abstract

The MinMaxScaler is a widely used technique in machine learning for feature scaling, typically used to transform data into a range between 0 and 1. However, the article highlights a critical issue: when applied to new data, MinMaxScaler can produce values below 0 or above 1, contrary to the common assumption. This can occur when the new data contains values outside the range of the training data's minimum and maximum. The author illustrates this with an example, showing that even a simple linear classifier can be adversely affected by these out-of-range values, as it may interpret them incorrectly, leading to inverted coefficients and flawed predictions. To mitigate this issue, the author suggests capping the scaled values to ensure they stay within the [0, 1] interval, thus maintaining the integrity of the classifier's performance.

Opinions

  • The author emphasizes the importance of understanding the behavior of MinMaxScaler beyond the typical [0, 1] range.
  • There is surprise expressed by the author regarding the capability of MinMaxScaler to return values outside the expected range.
  • The author provides a practical solution to the problem by recommending the capping of scaled values.
  • The author suggests that failing to address this issue could lead to a linear classifier working incorrectly, particularly if it encounters feature values with signs opposite to those seen during training.

Don’t Make This Mistake with Scaling Data

MinMaxScaler can return values smaller than 0 and greater than 1.

Photo by Kelly Sikkema on Unsplash

MinMaxScaler is one of the most commonly used scaling techniques in Machine Learning (right after StandardScaler).

From sklearns documentation:

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

Usually, when we use MinMaxScaler, we scale values between 0 and 1.

Did you know that MinMaxScaler can return values smaller than 0 and greater than 1? I didn’t know this and it surprised me.

The problem

Let’s look at an example. I initialize scaler with two features.

import numpy as np
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
scaler.fit(data)

Now, let’s check minimum and maximum values for these two features:

scaler.data_min
# [-1.  2.]
scaler.data_max_
# [ 1. 18.]

Those are estimated like expected.

Now, let's try to input values greater than the max:

scaler.transform(np.array([[2, 20]]))
# array([[1.5  , 1.125]])

Scalar returns a value greater than 1.

Or lower min:

scaler.transform(np.array([[-2, 1]]))
# array([[-0.5   , -0.0625]])

Scalar returns a value lower than 0.

No big deal, right?

The problem can occur when we train a linear classifier, which multiplies scaled features with coefficients.

The classifier hasn’t seen a negative value for a certain feature yet and it can invert the coefficient which makes the classifier work incorrectly.

The solution

I suggest you cap the outputs of MinMaxScaler between 0 and 1.

scaler.transform(np.array([[-2, 1]]))
# array([[0., 0.]])

Let’s connect

Talk: Book a call Socials: YouTube 🎥 | LinkedIn | Twitter Code: GitHub

Machine Learning
Data Science
Python
Scaling
Recommended from ReadMedium