avatarAngelica Lo Duca

Summary

This context provides a comprehensive tutorial on performing a complete data analysis workflow using Python, particularly the scikit-learn library, including data preprocessing, model selection with parameter tuning, and model evaluation using performance metrics.

Abstract

The provided text outlines a detailed data analysis process in Python, utilizing the scikit-learn library for machine learning tasks. It begins with loading a dataset from Kaggle, followed by feature selection, data normalization, and balancing techniques to address imbalanced classes. The tutorial demonstrates the use of MinMaxScaler for normalization and RandomOverSampler and RandomUnderSampler from the imblearn library for balancing. It then proceeds to model training with a KNeighborsClassifier, evaluating performance through ROC curves and precision-recall curves. The author emphasizes the importance of parameter tuning using GridSearchCV to optimize model performance. The workflow is illustrated with code snippets and visualizations, and the tutorial concludes with a summary of the steps and an invitation to explore related articles and the author's social media profiles for further learning.

Opinions

  • The author advocates for the use of scikit-learn and imblearn libraries for efficient data preprocessing and model training.
  • Data balancing is presented as a crucial step when dealing with imbalanced datasets to improve model performance.
  • The author suggests that both oversampling and undersampling techniques have their merits and should be considered depending on the dataset and the problem at hand.
  • Parameter tuning through GridSearchCV is highlighted as a method to enhance model accuracy, with the author demonstrating its application and the resulting improvement in performance metrics.
  • The author's preference for the KNeighborsClassifier is evident, as it is used throughout the tutorial to illustrate the workflow.
  • Visualizations, such as ROC curves and precision-recall curves, are recommended by the author as valuable tools for model evaluation and comparison.
  • The author encourages readers to engage with their previous work on outlier detection and to follow them on social media for updates on their research and activities.

Data Analysis

A complete Data Analysis workflow in Python and scikit-learn

A ready-to-run code including preprocessing, parameters tuning and model running and evaluation.

Image by Buffik from Pixabay

In this short tutorial I illustrate a complete data analysis process which exploits the scikit-learn Python library. The process includes

  • preprocessing, which includes features selection, normalization and balancing
  • model selection with parameters tuning
  • model evaluation

The code of this tutorial can be downloaded from my Github Repository.

Load Dataset

Firstly, I load the dataset through the Python pandas library. I exploit the heart.csv dataset, provided by the Kaggle repository.

import pandas as pd
df = pd.read_csv('source/heart.csv')
df.head()
Image by Author

I calculate the number of records and the number of columns in the dataset:

df.shape

which gives the following output:

(303, 14)

Features selection

Now, I split the columns of the dataset in input (X) and output (Y). I use all the columns but output as input features.

features = []
for column in df.columns:
    if column != 'output':
        features.append(column)
X = df[features]
Y = df['output']

In order to select the minimum set of input features, I calculate the Pearson correlation coefficient among features, through corr() function, provided by a pandas dataframe.

Image by Author

I note that all the features have a low correlation, thus I can keep all of them as input features.

Data Normalization

Data Normalization scales all the features in the same interval. I exploit the MinMaxScaler() provided by the scikit-learn library. I dealt with Data Normalization in scikit-learn in my previous article, while I this article I described the general process of Data Normalization without scikit-learn.

X.describe()
Image by Author

Looking at the minimum and maximum value for each feature, I note that there are many features out the range [0,1], thus I need to scale them.

For each input feature I calculate the MinMaxScaler() and I store the result in the same X column. The MinMaxScaler() must be fitted firstly through the fit() function and then can be applied for a transformation through the transform() function. Note that I must reshape every feature in the format (-1,1) in order to be passed as input parameter of the scaler. For example, Reshape(-1,1) transforms the array [0,1,2,3,5] into [[0],[1],[2],[3],[5]].

from sklearn.preprocessing import MinMaxScaler
for column in X.columns:
    feature = np.array(X[column]).reshape(-1,1)
    scaler = MinMaxScaler()
    scaler.fit(feature)
    feature_scaled = scaler.transform(feature)
    X[column] = feature_scaled.reshape(1,-1)[0]

Split the dataset in Training and Test

Now I split the dataset into two parts: training and testset. The test set size is 20% of the whole dataset. I exploit the scikit-learn function train_test_split(). I will use the training set to train the model and the testset to test the performance of the model.

import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.20, random_state=42)

Balancing

I check whether the dataset is balanced or not, i.e. if the output classes in the training set are equally represented. I can use the value_counts() function to calculate the number of records in each output class.

y_train.value_counts()

which gives the following output:

1    133
0    109

The output classes are not balanced, thus I can balance it. I can exploit the imblearn library, to perform balancing. I try both oversampling the minority class and undersampling the majority class. More details related to the Imbalanced Learn library can be found here. Firstly, I perform over sampling through the RandomOverSampler(). I create the model and then I fit with the training set. The fit_resample() function returns the balanced training set.

from imblearn.over_sampling import RandomOverSampler
over_sampler = RandomOverSampler(random_state=42)
X_bal_over, y_bal_over = over_sampler.fit_resample(X_train, y_train)

I calculate the number of records in each class through the value_counts() function and I note that now the dataset is balanced.

y_bal_over.value_counts()

which gives the following output:

1    133
0    133

Secondly, I perform under sampling through the RandomUnderSampler() model.

from imblearn.under_sampling import RandomUnderSampler
under_sampler = RandomUnderSampler(random_state=42)
X_bal_under, y_bal_under = under_sampler.fit_resample(X_train, y_train)

Model Selection and Training

Now, I’m ready to train the model. I choose a KNeighborsClassifier and firstly I train it with imbalanced data. I exploit the fit() function to train the model and then thepredict_proba() function to predict the values of the test set.

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)

I calculate the performance of the model. In particular, I calculate the roc_curve() and the precision_recall() and then I plot them. I exploit the scikitplot library to plot curves.

From the plot I note that there is a roc curve for each class. With respect to the precision recall curve, the class 1 works better than class 0, probably because it is represented by a greater number of samples.

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
from scikitplot.metrics import plot_roc,auc
from scikitplot.metrics import plot_precision_recall
fpr0, tpr0, thresholds = roc_curve(y_test, y_score[:, 1])
# Plot metrics 
plot_roc(y_test, y_score)
plt.show()
    
plot_precision_recall(y_test, y_score)
plt.show()
Image by Author
Image by Author

Now, I recalculate the same things with oversampling balancing. I note that the precision recall curve of class 0 increases, while that of class 1 decreases.

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_bal_over, y_bal_over)
y_score = model.predict_proba(X_test)
fpr0, tpr0, thresholds = roc_curve(y_test, y_score[:, 1])
# Plot metrics 
plot_roc(y_test, y_score)
plt.show()
    
plot_precision_recall(y_test, y_score)
plt.show()
Image by Author
Image by Author

Finally, I train the model through under sampled data and I note a general deterioration of the performance.

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_bal_under, y_bal_under)
y_score = model.predict_proba(X_test)
fpr0, tpr0, thresholds = roc_curve(y_test, y_score[:, 1])
# Plot metrics 
plot_roc(y_test, y_score)
plt.show()
    
plot_precision_recall(y_test, y_score)
plt.show()
Image by Author
Image by Author

Parameters Tuning

In the last part of this tutorial, I try to improve the performance of the model by searching for best parameters for my model. I exploit the GridSearchCV mechanism provided by the scikit-learn library. I select a range of values for each parameter to be tested and I put them in the param_grid variable. I create a GridSearchCV() object, I fit with the training set and then I retrieve the best estimator, contained in the best_estimator_ variable.

from sklearn.model_selection import GridSearchCV
model = KNeighborsClassifier()
param_grid = {
   'n_neighbors': np.arange(2,8),
   'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'metric' : ['euclidean','manhattan','chebyshev','minkowski']
}
grid = GridSearchCV(model, param_grid = param_grid)
grid.fit(X_train, y_train)
best_estimator = grid.best_estimator_

I exploit the best estimator as model for my predictions and I calculate the performance of the algorithm.

best_estimator.fit(X_train, y_train)
y_score = best_estimator.predict_proba(X_test)
fpr0, tpr0, thresholds = roc_curve(y_test, y_score[:, 1])
# Plot metrics 
plot_roc(y_test, y_score)
plt.show()
    
plot_precision_recall(y_test, y_score)
plt.show()
Image by Author
Image by Author

I note that the roc curve has improved. I try now with the over sampled training set. I omit the code because it is the same as before. In this case I obtain the best performance.

Image by Author
Image by Author

Summary

In this tutorial I have illustrated the full workflow to build a good model for data analysis. The workflow includes:

  • data preprocessing, with features selection and balancing
  • model selection and parameters tuning with Grid Search with Cross Validation
  • model evaluation, through the ROC curve and the Precision Recall curve.

In this tutorial I have not dealt with Outliers Detection. If you want to learn something about this aspect, you can give a look to my previous article.

If you wanted to be updated on my research and other activities, you can follow me on Twitter, Youtube and and Github.

Related Articles

New to Medium? You can subscribe for few dollars per month and unlock unlimited articles — click here.

Data Science
Data Analysis
Python
Scikit Learn
Imbalanced Data
Recommended from ReadMedium