Data Analysis
How to Check if a Classification Model is Overfitted using scikit-learn
A ready-to-run tutorial in Python which helps to identify and reduce overfitting
One of the hardest problems, when dealing with Machine Learning algorithms, is evaluating whether the trained model performs well with unseen samples. For example, it may happen that a model behaves very well with a given dataset, but it is not able to predict the correct values, when deployed. This discordance between the trained and testing data can be due to different problems. One of the most common problems is overfitting.
A model thats fits the training set well but testing set poorly is said to be overfit to the training set and a model that fits both sets poorly is said to be underfit. Extracted from this very interesting article by Joe Kadi.
In other words, overfitting means that the Machine Learning model is able to model the training set too well.
In this tutorial I exploit the Python scikit-learn
library to check whether a classification model is overfitted. The same procedure can be also exploited for other models, such as regressions. The proposed strategy involves the following steps:
- split the dataset into training and test sets
- train the model with the training set
- test the model on the training and test sets
- calculate the Mean Absolute Error (MAE) for training and test sets
- plot and interpret results
The previous steps must be executed for different training and test sets.
As example dataset, I use the Heart Attack dataset, available in the Kaggle repository. All the code can be downloaded from my Github Repository.
Load Data
Firstly, I load the dataset as a Dataframe through the Python pandas
library. The dataset contains 303 records, 13 input features and 1 output class, which can be either 0 or 1.
import pandas as pd
df = pd.read_csv('source/heart.csv')
df.head()
I build the dataset. I define two variables, X
and y
, corresponding to the input and output, respectively.
features = []
for column in df.columns:
if column != 'output':
features.append(column)
X = df[features]
y = df['output']
Build and test the model
Usually, X
and y
are split into two datasets: training and test sets. In scikit-learn
this can be done through the train_test_split()
function, which returns the training and test data. Then, the model is fitted through the training data and its performance is tested through the test data. However, the described strategy does not permit to verify whether the model is overfitted or not.
For this reason, I do not use the train_test_split()
function, but the K Folds cross-validation.
K Folds splits the dataset into k subsets, and trains the model k-times on different training sets, and tests the model k-times on different test sets. Each time, the training set is built composing k-1 subsets, while the test set is the remaining subset.
The scikit-learn
library provides a class for K Folds, called KFold()
, which receives as input the number k
. For each pair (training set, test set), I can build the model and calculate the Mean Absolute Error (MAE) both for training and test sets. In this specific example I exploit the KNeighborsClassifier()
.
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
kf = KFold(n_splits=4)
mae_train = []
mae_test = []
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
model = KNeighborsClassifier(n_neighbors=2)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
mae_train.append(mean_absolute_error(y_train, y_train_pred))
mae_test.append(mean_absolute_error(y_test, y_test_pred))
Then I can plot the training and test MAEs and compare them.
folds = range(1, kf.get_n_splits() + 1)
plt.plot(folds, mae_train, 'o-', color='green', label='train')
plt.plot(folds, mae_test, 'o-', color='red', label='test')
plt.legend()
plt.grid()
plt.xlabel('Number of fold')
plt.ylabel('Mean Absolute Error')
plt.show()
I note that the training MAE is very small (around 0.2) for all k folds. The testing MAE, instead, is very big. It ranges from 0.3 to 0.8. Since the training MAE is small and the testing MAE is big, I can conclude that the model is overfitted.
I group all the previous operations into a single function, called test_model()
, which receives as input the model and the X
and y
variables.
Limit overfitting
Overfitting can be (potentially) limited following two strategies:
- reduce complexity
- tune parameters
- change model.
1. Reduce Complexity
I try to improve the model, by scaling all the input features, into a range between 0 and 1. I exploit the MinMaxScaler()
provided by the scikit-learn
library.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
for column in X.columns:
feature = np.array(X[column]).reshape(-1,1)
scaler = MinMaxScaler()
scaler.fit(feature)
feature_scaled = scaler.transform(feature)
X[column] = feature_scaled.reshape(1,-1)[0]
I build a new model and I invoke the test_model()
function. Now the performance of the model both on training and test sets improves. However, the testing MAE still remains big. Thus, the model is still overfitted.
model = KNeighborsClassifier(n_neighbors=2)
test_model(model, X,y)
Another possibility to reduce complexity could be to reduce the number of features. This could be achieved through Principal Component Analysis (PCA). For example, I could reduce the number of input features from 13 to 2. The scikit-learn
library provides the PCA()
class for this purpose.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)
X_pca = pd.DataFrame(X_pca)
I test the model on the new features and I note that the performance of the model improves. Now the testing MAE ranges from 0.20 to 0.45.
2. Tune parameters
Another possibility could to tune the algorithms parameters. The scikit-learn
library provides the GridSearchCV()
class, which permits to search for the best parameters of a specific model. The parameters to be tuned must be passed as a dict
where for each parameter, the list of values to be analysed must be passed. In addition, GridSearchCV()
exploits also Cross Validation. The best estimator is available after fitting, in the variable best_estimator_
.
from sklearn.model_selection import GridSearchCV
model = KNeighborsClassifier()
param_grid = {
'n_neighbors': np.arange(1,30),
'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
'metric' : ['euclidean','manhattan','chebyshev','minkowski']
}
grid = GridSearchCV(model, param_grid = param_grid, cv=4)
grid.fit(X, y)
best_estimator = grid.best_estimator_
I test the model. I note that the testing MAE ranges from about 0.25 to 0.45.
test_model(best_estimator, X,y)
3. Change the model
The previous tentatives reduce overfitting. However, the model performance still remain poor. Thus, I try to change model. I try a GaussianNB()
model. I note that the performance incredibly improves. The testing MAE, in fact, ranges between 0.20 and 0.35, thus I can conclude that the model is not overfitted.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
test_model(model, X,y)
Summary
In this tutorial, I have illustrated how to check whether a classification model is overfitted or not. In addition, I have proposed three strategies to limit overfitting: reduce complexity, tune parameters and change the model.
As described in this specific example, often, we use the wrong model to deal with a problem. Thus, my suggestion is: please try different models!
If you wanted to be updated on my research and other activities, you can follow me on Twitter, Youtube and and Github.