Yes, You Can Build Your Own Custom Sklearn Transformers. Here Is How

Transformers for any preprocessing scenario

Learn to write custom Sklearn preprocessing transformers that make your code exceptional.

Introduction

Single fit, single predict—how awesome would that be?

You get the data, fit your pipeline just one time, and it takes care of everything — preprocessing, feature engineering, modeling, everything. All you have to do is call predict and have the output.

What kind of pipeline is that powerful? Sklearn has many transformers, but it doesn’t have one for every imaginable preprocessing scenario. So, is such a pipeline a pipe dream?

Absolutely not. Today, we will learn how to create custom Sklearn transformers that enable you to integrate virtually any function or data transformation into Sklearn’s Pipeline classes.

What are Sklearn pipelines?

Below is a simple pipeline that imputes the missing values in numeric data, scales them, and fits an XGBRegressor to X, y:

I have talked at length about the nitty-gritty of Sklearn pipelines and their benefits in an older post.

How to Use Sklearn Pipelines For Ridiculously Neat Code

Everything I love about Scikit-Learn, in one place

towardsdatascience.com

The most notable advantages of pipelines are their ability to collapse all preprocessing and modeling steps into a single estimator, preventing data leakage by never calling fit on validation sets and an added bonus that makes the code concise, reproducible, and modular.

But this whole idea of atomic, neat pipelines breaks when we need to perform operations that are not built into Sklearn as estimators. For example, what if you need to extract regex patterns to clean text data? What do you do if you want to create a new feature combining existing ones based on domain knowledge?

To keep all the benefits that come with pipelines, you need a way to integrate your custom preprocessing and feature engineering logic into Sklearn. That’s where custom transformers come into play.

Integrating simple functions with FunctionTransformer

In September 2021 TPS Competition on Kaggle, one of the ideas that boosted model performance significantly was adding the number of missing values in a row as a new feature. This is a custom operation, not implemented in Sklearn, so let’s create a function to achieve that after importing the data:

Let’s create a function that takes a DataFrame as input and implements the above operation:

Now, adding this function into a pipeline is just as easy as passing it to the FunctionTransformer:

Passing a custom function to FunctionTransformer creates an estimator with fit, transform and fit_transform methods:

Since we have a simple function, there is no need to call fit as it just returns the estimator untouched. The only requirement of FunctionTransformer is that the passed function should accept the data as its first argument. Optionally, you can pass the target array as well if you need it inside the function:

FunctionTransformer also accepts an inverse of the passed function if you ever need to revert the changes:

Check out the documentation for details on other arguments.

Integrating more complex preprocessing steps with custom transformers

One of the most common scaling options for skewed data is a logarithmic transform. But here is a caveat: if a feature has even a single 0, the transformation with the common np.log function return an error. So, as a workaround, Kagglers add 1 to all samples and then apply the logarithmic transform.

Custom transformations like that require inverse transformations as well, For logarithms, you need to use the exponential function on the transformed array and subtract 1. Here is what it looks like in code:

This works, but we have the same old problem — we can’t include this into a pipeline out of the box. Sure, we could use our newfound friend FunctionTransformer, but it is not well-suited for more complex preprocessing steps such as this.

Instead, we will write a custom transformer class and create the fit, transform functions manually. In the end, we will again have a Sklearn-compatible estimator that we can pass into a pipeline. Let's start:

We first create a class that inherits from BaseEstimator and TransformerMixin classes of sklearn.base. Inheriting these classes allows Sklearn pipelines to recognize our classes as custom estimators.

Then, we will write the __init__ method, where we initialize an instance of PowerTransformer:

Next, we write the fit where we add 1 to all features in the data and fit the PowerTransformer:

The fit method should return the transformer itself, which is done by returning self. Let's test what we have done so far:

Working as expected, so far.

Next, we have the transform, in which we use the transform method of PowerTransformer after adding 1 to the passed data:

Let’s make another check:

Working as expected. Now, as I said earlier, we need a method for reverting the transform:

We also could have used np.exp instead of inverse_transform. Now, let's make a final check:

But wait! We didn’t write fit_transform - where did that come from?

It is simple — when you inherit from BaseEstimator and TransformerMixin, you get a fit_transform method for free.

After the inverse transform, you can compare it with the original data:

Now, we have a custom transformer ready to be included in a pipeline. Let’s put everything together:

Even though log transform hurt the score, we got our custom pipeline working!

In short, the signature of your custom transformer class should be like this:

This way, you get fit_transform for free. If you don't need any of __init__, fit, transform or inverse_transform methods, omit them, and the parent Sklearn classes take care of everything. The logic of these methods is entirely up to your needs and your skills.

Wrapping up…

Writing good code is a skill developed over time. You will realize that a big part of it comes from using the existing tools and libraries at the right time and place without reinventing the wheel.

One such tool is Sklearn pipelines, and custom transformers are just extensions of them. Use them well, and you will produce quality code with little effort.

Thank you for readıng!