The article provides a comprehensive guide on building Scikit-learn pipelines for data preprocessing, model training, and hyperparameter tuning using GridSearchCV, demonstrated through a heart failure dataset project.
Abstract
The article "How To Build Scikit-learn Pipelines Like A Pro" by Yash Prakash offers a hands-on tutorial for creating efficient machine learning workflows using Scikit-learn. It guides readers through the process of handling a Kaggle dataset for heart failure prediction, detailing the steps to preprocess data with pipelines that include MinMaxScaler and SimpleImputer, model training using a Random Forest classifier, and hyperparameter optimization with GridSearchCV. The author emphasizes the convenience of pipelines for streamlining the machine learning process and demonstrates the approach's effectiveness by achieving high accuracy in predictions. The article concludes with an invitation to follow the author on Medium and GitHub for more data science content and a call to action for readers to engage with more Scikit-learn-based articles in the future.
Opinions
The author believes that using pipelines is a "neat" and "convenient" way to handle machine learning workflows in code.
The use of a real-world Kaggle dataset for heart failure prediction is presented as a practical and engaging way to learn about building pipelines.
The author suggests that the ability to perform hyperparameter searches within the pipeline framework is "cool" and enhances the model's performance.
Yash Prakash encourages readers to star and bookmark the provided GitHub repository, indicating a desire for community engagement and recognition of the repository's value.
The author expresses enthusiasm for future Scikit-learn-based articles, indicating a commitment to continuous learning and sharing within the data science community.
The recommendation to become a Medium member to access the author's weekly data science articles implies the author's confidence in the quality and relevance of their content to the data science community.
How To Build Scikit-learn Pipelines Like A Pro
Learn to build preprocessing, model, as well as Grid Search pipelines the easy way with a mini project
Every time you pick up a dataset for a project, you are tasked with cleaning and preprocessing the data, dealing with missing data and outliers, modelling, and even performing hyperparameter searches to find the optimal set of hyperparameters to use for your estimators.
Apparently, there is a convenient and neat way to do this in code with Pipelines.
In this article, we will go through a fairly popular Kaggle dataset and perform all of these steps and build a real sklearn pipeline to learn from.
Let’s get started👇
Exploring the dataset
The dataset we will be using for this mini project will be from Kaggle — Heart Failure Detection Tabular Dataset available under the Creative Common’s license. Grab it from the below Kaggle link:
The next step is to split the dataset into training and test sets. Except the last column which is “Death Event” we have all our features for training. Looking at the last column, we can see that it is a Binary classification task.
The shape of the data:
Output:
((209,12), (90,12), (209,), (90,))
Finally, we explore all the numerical columns of our dataset:
X_train.describe().T
image by author — describe the dataset
Looking at categorical data, we verify that there are none:
First, we build our preprocessing pipeline. It will consist of two components — 1) a MinMaxScalar instance for transforming the data to be between (0, 1), and 2) aSimpleImputer instance for filling the missing values using the mean of the existing values in the columns.
col_transformation_pipeline = Pipeline(steps=[
('impute', SimpleImputer(strategy='mean')),
('scale', MinMaxScaler())
])
We put them together using a ColumnTransformer.
A ColumnTransformercan take tuples consisting of different column transformations we need to apply on our data. It also expects a list of columns to go along with, for each transformation. Since we only have numeric columns here, we supply all of our columns to our column transformer object.
Let’s put it all together then:
Awesome! The first part of our pipeline is done!
Let’s go and build our model now.
The model pipeline
We choose a Random Forest classifier for this task. Let’s spin up a quick classifier object:
And, we can combine our preprocessing and models in a single pipeline:
rf_model_pipeline = Pipeline(steps=[
('preprocessing', columns_transformer),
('rf_model', rf_classifier),
])
Now, fitting on our training data is simple enough:
rf_model_pipeline.fit(X_train, y_train)
And finally, we can predict on our test set and calculate our accuracy score:
# predictontestset
y_pred = rf_model_pipeline.predict(X_test)
Putting it together:
This is all well and good. However, what if I said that you could perform Grid Search for finding optimal hyperparameters with this pipeline as well? Wouldn’t that be cool?
Let’s explore that next!
Using GridSearch with our pipeline
We have already build and used our model for prediction our dataset. We will now focus on finding the best hyperparameters for our random forest model.
In this case, we focus on tuning three parameters for our model:
n_estimators: The number of trees in random forest,
2. criterion: The function to measure the quality of a split, and
3. max_depth : The maximum depth of the tree
One important thing to note here is that: Instead of simply using n_estimators as the parameter name in our grid, we use: rf_model__n_estimators. Here rf_model__ prefix comes from the name we chose for our random forest model in our pipeline. (refer to the previous section).
Next, we simply use the GridSearch module to train our classifier:
Now, it is easy enough to predict using our grid_search object like so:
image by author — accuracy score
Awesome! We have now built a full pipeline for our project!
A few parting words…
So, there you have it! A full sklearn pipeline consisting of a preprocessor, a model, and grid search all experimented upon a mini project from Kaggle. I hope you find this tutorial illuminating and easy to follow along.