avatarYash Prakash

Summary

The article provides a comprehensive guide on building Scikit-learn pipelines for data preprocessing, model training, and hyperparameter tuning using GridSearchCV, demonstrated through a heart failure dataset project.

Abstract

The article "How To Build Scikit-learn Pipelines Like A Pro" by Yash Prakash offers a hands-on tutorial for creating efficient machine learning workflows using Scikit-learn. It guides readers through the process of handling a Kaggle dataset for heart failure prediction, detailing the steps to preprocess data with pipelines that include MinMaxScaler and SimpleImputer, model training using a Random Forest classifier, and hyperparameter optimization with GridSearchCV. The author emphasizes the convenience of pipelines for streamlining the machine learning process and demonstrates the approach's effectiveness by achieving high accuracy in predictions. The article concludes with an invitation to follow the author on Medium and GitHub for more data science content and a call to action for readers to engage with more Scikit-learn-based articles in the future.

Opinions

  • The author believes that using pipelines is a "neat" and "convenient" way to handle machine learning workflows in code.
  • The use of a real-world Kaggle dataset for heart failure prediction is presented as a practical and engaging way to learn about building pipelines.
  • The author suggests that the ability to perform hyperparameter searches within the pipeline framework is "cool" and enhances the model's performance.
  • Yash Prakash encourages readers to star and bookmark the provided GitHub repository, indicating a desire for community engagement and recognition of the repository's value.
  • The author expresses enthusiasm for future Scikit-learn-based articles, indicating a commitment to continuous learning and sharing within the data science community.
  • The recommendation to become a Medium member to access the author's weekly data science articles implies the author's confidence in the quality and relevance of their content to the data science community.

How To Build Scikit-learn Pipelines Like A Pro

Learn to build preprocessing, model, as well as Grid Search pipelines the easy way with a mini project

Photo by Mark Boss on Unsplash

Every time you pick up a dataset for a project, you are tasked with cleaning and preprocessing the data, dealing with missing data and outliers, modelling, and even performing hyperparameter searches to find the optimal set of hyperparameters to use for your estimators.

Apparently, there is a convenient and neat way to do this in code with Pipelines.

In this article, we will go through a fairly popular Kaggle dataset and perform all of these steps and build a real sklearn pipeline to learn from.

Let’s get started👇

Exploring the dataset

The dataset we will be using for this mini project will be from Kaggle — Heart Failure Detection Tabular Dataset available under the Creative Common’s license. Grab it from the below Kaggle link:

Let’s import it and see what it looks like!

image by author — data preview

The next step is to split the dataset into training and test sets. Except the last column which is “Death Event” we have all our features for training. Looking at the last column, we can see that it is a Binary classification task.

The shape of the data:
Output:
((209, 12), (90, 12), (209,), (90,))

Finally, we explore all the numerical columns of our dataset:

X_train.describe().T
image by author — describe the dataset

Looking at categorical data, we verify that there are none:

# there are no categorical features
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
categorical_features
image by author — no cat features

And now, we can move on to building our pipeline!

Our Scikit-learn Pipeline

The preprocessing pipeline

First, we build our preprocessing pipeline. It will consist of two components — 1) a MinMaxScalar instance for transforming the data to be between (0, 1), and 2) aSimpleImputer instance for filling the missing values using the mean of the existing values in the columns.

col_transformation_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

We put them together using a ColumnTransformer.

A ColumnTransformer can take tuples consisting of different column transformations we need to apply on our data. It also expects a list of columns to go along with, for each transformation. Since we only have numeric columns here, we supply all of our columns to our column transformer object.

Let’s put it all together then:

Awesome! The first part of our pipeline is done!

Let’s go and build our model now.

The model pipeline

We choose a Random Forest classifier for this task. Let’s spin up a quick classifier object:

# random forest classifier
rf_classifier = RandomForestClassifier(n_estimators = 11, criterion='entropy', random_state=0)

And, we can combine our preprocessing and models in a single pipeline:

rf_model_pipeline = Pipeline(steps=[
    ('preprocessing', columns_transformer),
    ('rf_model', rf_classifier),
])

Now, fitting on our training data is simple enough:

rf_model_pipeline.fit(X_train, y_train)

And finally, we can predict on our test set and calculate our accuracy score:

# predict on test set
y_pred = rf_model_pipeline.predict(X_test)

Putting it together:

This is all well and good. However, what if I said that you could perform Grid Search for finding optimal hyperparameters with this pipeline as well? Wouldn’t that be cool?

Let’s explore that next!

Using GridSearch with our pipeline

We have already build and used our model for prediction our dataset. We will now focus on finding the best hyperparameters for our random forest model.

Let’s build up our grid of parameters first:

params_dict = {'rf_model__n_estimators' : np.arange(5, 100, 1), 'rf_model__criterion': ['gini', 'entropy'], 'rf_model__max_depth': np.arange(10, 200, 5)}

In this case, we focus on tuning three parameters for our model:

  1. n_estimators: The number of trees in random forest,

2. criterion: The function to measure the quality of a split, and

3. max_depth : The maximum depth of the tree

One important thing to note here is that: Instead of simply using n_estimators as the parameter name in our grid, we use: rf_model__n_estimators. Here rf_model__ prefix comes from the name we chose for our random forest model in our pipeline. (refer to the previous section).

Next, we simply use the GridSearch module to train our classifier:

grid_search = GridSearchCV(rf_model_pipeline, params_dict, cv=10, n_jobs=-1)
grid_search.fit(X_train, y_train)

Let’s put it all together into one:

Now, it is easy enough to predict using our grid_search object like so:

image by author — accuracy score

Awesome! We have now built a full pipeline for our project!

A few parting words…

So, there you have it! A full sklearn pipeline consisting of a preprocessor, a model, and grid search all experimented upon a mini project from Kaggle. I hope you find this tutorial illuminating and easy to follow along.

It’s time to give yourself a pat on the back! 😀

Find the entire code for this tutorial here. This is the code repository of all of my data science articles. Star and bookmark it if you please!

In the future, I’ll be coming back and doing some more Scikit-learn based articles. So follow me on Medium and stay in the loop!

I also recommend becoming a Medium member to never miss any of the Data Science articles I publish every week. Join here 👇

Get connected!

Follow me on Twitter. Check out the full code repository of all of my Data Science posts!

A few other articles of mine you might be interested in:

Artificial Intelligence
Data Science
Machine Learning
Deep Learning
Programming
Recommended from ReadMedium