avatarNicolas Bertagnolli

Summary

This context provides a tutorial on how to extract feature importances from any sklearn pipeline, focusing on pipelines and feature unions.

Abstract

The context begins with an introduction to the usefulness of pipelines in data science projects and the difficulty of extracting feature importances from them. It then proceeds to demonstrate how to access individual elements in a pipeline, such as a TfidfVectorizer and a LinearSVC classifier, and extract their feature names and coefficients. The author provides code examples and visualizations to illustrate the process.

The tutorial then moves on to a more complex example involving a FeatureUnion, which concatenates the results of multiple transformers. The author shows how to extract features from this model and visualize them.

Finally, the author presents a generalized solution for extracting feature importances from any pipeline using a depth-first search algorithm. The algorithm handles three cases: featurization steps, pipelines, and feature unions. The author provides code examples and explanations for each case.

Bullet points

  • Pipelines are useful in data science projects but extracting feature importances can be difficult.
  • Individual elements in a pipeline can be accessed using the named_steps parameter.
  • Feature names and coefficients can be extracted from featurizers and classifiers, respectively.
  • FeatureUnion concatenates the results of multiple transformers.
  • A depth-first search algorithm can be used to extract feature importances from any pipeline.
  • The algorithm handles three cases: featurization steps, pipelines, and feature unions.

Getting Started

How to Get Feature Importances from Any Sklearn Pipeline

Pipelines can be hard to navigate here’s some code that works in general.

Photo by Quinten de Graaf on Unsplash

Introduction

Pipelines are amazing! I use them in basically every data science project I work on. But, easily getting the feature importance is way more difficult than it needs to be. In this tutorial, I’ll walk through how to access individual feature names and their coefficients from a Pipeline. After that, I’ll show a generalized solution for getting feature importance for just about any pipeline.

Pipelines

Let’s start with a super simple pipeline that applies a single featurization step followed by a classifier.

from datasets import list_datasets, load_dataset, list_metrics
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
# Load a dataset and print the first examples in the training set
imdb_data = load_dataset('imdb')
classifier = svm.LinearSVC(C=1.0, class_weight="balanced")
model = Pipeline(
    [
        ("vectorizer", TfidfVectorizer()),
        ("classifier", classifier),
    ]
)
x_train = [x["text"]for x in imdb_data["train"]]
y_train = [x["label"]for x in imdb_data["train"]]
model.fit(x_train, y_train)

Here we use the excellent datasets python package to quickly access the imdb sentiment data. This package put together by HuggingFace has a ton of great datasets and they are all ready to go so you can get straight to the fun model building.

The above pipeline defines two steps in a list. It first takes input and passes it through a TfidfVectorizer which takes in text and returns the TF-IDF features of the text as a vector. It then passes that vector to the SVM classifier.

Notice how this happens in order, the TF-IDF step then the classifier. You can chain as many featurization steps as you’d like. For example, the above pipeline is equivalent to:

model = Pipeline(
    [
        ("vectorizer", CountVectorizer()),
        ("transformer", TfidfTransformer()),
        ("classifier", classifier),
    ]
)

Here we do things even more manually. First, we get counts of every word, second, we apply the TF-IDF transformation, and finally, we pass this feature vector to the classifier. The TfidfVectorizer does those two in one step. But this illustrates the point. In a raw pipeline, things execute in order. We’ll discuss how to stack features together a little later. For now, let’s work on getting the feature importance for our first example model.

Feature Importances

Pipelines make it easy to access the individual elements. If you print out the model after training you’ll see:

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(...)
                ('classifier',
                 LinearSVC(...))],
         verbose=False)

This is saying there are two steps, one named vectorizer the other named classifier. We can access these by looking at the named_steps parameter of the pipeline like so:

model.named_steps["vectorizer"]

This will return our fitted TfidfVectorizer. Pretty neat! Most featurization steps in Sklearn also implement a get_feature_names() method which we can use to get the names of each feature by running:

# Get the names of each feature
feature_names = model.named_steps["vectorizer"].get_feature_names()

This will give us a list of every feature name in our vectorizer. Then we just need to get the coefficients from the classifier. For most classifiers in Sklearn this is as easy as grabbing the .coef_ parameter. (Ensemble methods are a little different they have a feature_importances_ parameter instead)

# Get the coefficients of each feature
coefs = model.named_steps["classifier"].coef_.flatten()

Now we have the coefficients in the classifier and also the feature names. Let’s put them together into a nice plot.

import pandas as pd
# Zip coefficients and names together and make a DataFrame
zipped = zip(feature_names, coefs)
df = pd.DataFrame(zipped, columns=["feature", "value"])
# Sort the features by the absolute value of their coefficient
df["abs_value"] = df["value"].apply(lambda x: abs(x))
df["colors"] = df["value"].apply(lambda x: "green" if x > 0 else "red")
df = df.sort_values("abs_value", ascending=False)

And visualize:

import seaborn as sns
fig, ax = plt.subplots(1, 1, figsize=(12, 7))
sns.barplot(x="feature",
            y="value",
            data=df.head(20),
           palette=df.head(20)["colors"])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=20)
ax.set_title("Top 20 Features", fontsize=25)
ax.set_ylabel("Coef", fontsize=22)
ax.set_xlabel("Feature Name", fontsize=22)

So we can see that negative unigrams seem to be the most impactful. That’s pretty cool. Getting these feature importance was easy. Let’s try a slightly more complicated example.

Get Feature Importances from a FeatureUnion

In most real applications I find I’m combining lots of features together in intricate ways. Earlier we saw how a pipeline executes each step in order. How do we handle multiple simultaneous steps? The answer is the FeatureUnion class. Let’s say we want to build a model where we take in TF-IDF bigram features but have some hand curated unigrams as well. (See my blog post on using models to find good unigrams here.) We can define this pipeline using a FeatureUnion. A FeatureUnion takes a transformer_list which can be a list of transformers, pipelines, classifiers, etc. and then concatenates their results.

classifier = svm.LinearSVC(C=1.0, class_weight="balanced")
vocab = {"worst": 0, "awful": 1, "waste": 2,
         "boring": 3, "excellent": 4}
model = Pipeline([
    ("union", FeatureUnion(transformer_list=[
        ("handpicked", TfidfVectorizer(vocabulary=vocab)),
        ("bigrams", TfidfVectorizer(ngram_range=(2, 2)))])
    ),
    ("classifier", classifier),
    ])

As you can see at a high level our model has two steps a union and a classifier. Inside the union we do two distinct featurization steps. We find a set of hand picked unigram features and then all bigram features.

Extracting the features from this model is slightly more complicated. We have to go into the union, and then get all the individual features. Let’s try and do this by hand and then see if we can generalize to any arbitrary Pipeline. We already know how to access members of a pipeline, it’s the named_steps. To get inside of the FeatureUnion we can look directly at the transformer_list and step through each element. So the code would look something like this.

handpicked = (model
              .named_steps["union"]
              .transformer_list[0][1]
              .get_feature_names())
bigrams = (model
           .named_steps["union"]
           .transformer_list[1][1]
           .get_feature_names())
feature_names = bigrams + handpicked

Since the classifier is an SVM that operates on a single vector the coefficients will come from the same place and be in the same order. We can visualize our results again.

Looks like our bigrams were much more informative than our hand selected unigrams.

The General Case

So we’ve done some simple examples but now we want a way to do this for any (roughly any) Pipeline and FeatureUnion combination. For that we turn to our old friend Depth First Search (DFS). We are going to view a Pipeline as a tree. Each layer can have an arbitrary number of FeatureUnions but they will all stack up to a single feature vector in the end. There are roughly three cases to consider when traversing. The first is the base case where we are in an actual transformer or classifier that will generate our features. The second is if we are in a Pipeline. The third and final case is when we are inside of a FeatureUnion. Let’s talk about these in a little more depth.

Case 1: Featurization Step

Here we want to write a function which given a featurizer of some kind will return the names of the features. This is the base case in our DFS. In Sklearn there are a number of different types of things which can be used for generating features. Some examples are clustering techniques, dimensionality reduction methods, traditional classifiers, and preprocessors to name a few. Each one lets you access the feature names in a different way. For example, the text preprocessor TfidfVectorizer implements a get_feature_names method like we saw above. However, most clustering methods don’t have any named features, they are arbitrary clusters, but they do have a fixed number of clusters. Let’s write a helper function that given a Sklearn featurization method will return a list of features.

Here we try and enumerate a number of potential cases that can occur inside of Sklearn. We use hasattr to check if the provided model has the given attribute, and if it does we call it to get feature names. If the method is something like clustering and doesn’t involve actual named features we construct our own feature names by using a provided name. For example let’s say we apply this method to PCA with two components and we’ve named the step pca then the resultant feature names returned would be [pca_0, pca_1].

DFS

Now we can implement the DFS.

Let’s step through this together. This function will take three things. The first is the model we want to analyze. This model should be a Pipeline. The second is a list of all named featurization steps we want to pull out. In our last example this was bigrams and handpicked. These are the names of the individual steps that we used in our model. The last parameter is the current name we are looking at. This is necessary for the recursion and doesn’t matter on first pass. (I should make a helper method to hide this from the end user but this is less code to explain for now).

  • Lines 19–25 form the base case. They deal with the situation when the name of the step matches a name in our list of desired names. This corresponds with a leaf node that actually does featurization and we want to get the names from.
  • Lines 26–30 manage instances when we are at a Pipeline. When this happens we want to get the names of each step by accessing the named_steps parameter and then recurse through them to collect the features. We step through each named step in the Pipeline and get all the feature names combining them in a list.
  • Lines 31–35 manage instances when we are at a FeatureUnion. When this happens we want to get the names of each sub transformer from the transformer_list parameter and then recurse through them to collect the features.

With this in hand we can now take an arbitrarily nested pipeline, say for example the below code, and get the feature names in the correct order!

from sklearn.decomposition import TruncatedSVD
classifier = svm.LinearSVC(C=1.0, class_weight="balanced")
vocab = {"worst": 0, "awful": 1, "waste": 2,
         "boring": 3, "excellent": 4}
model = Pipeline([
    ("union", FeatureUnion(transformer_list=[
        ("h1", TfidfVectorizer(vocabulary={"worst": 0})),
        ("h2", TfidfVectorizer(vocabulary={"best": 0})),
        ("h3", TfidfVectorizer(vocabulary={"awful": 0})),
        ("tfidf_cls", Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TfidfTransformer()),
            ("tsvd", TruncatedSVD(n_components=2))
        ]
        ))
    ])
     ),
    ("classifier", classifier),
])

In this example, we construct three hand written rule featurizers and also a sub pipeline which does multiple steps and results in dimensionality reduced features. We can get all the feature names from this pipeline using one line!

get_feature_names(model, ["h1", "h2", "h3", "tsvd"], None)

Which will return

['worst', 'best', 'awful', 'tsvd_0', 'tsvd_1']

Exactly what we’d expect.

Conclusion

There are a lot of ways to mix and match steps in a pipeline and getting the feature names can be kind of a pain. If we use DFS we can extract them all in the correct order. This method will work for most cases in SciKit-Learn’s ecosystem but I haven’t tested everything. To extend it you just need to look at the documentation of whatever class you’re trying to pull names from and update the extract_feature_names method with a new conditional checking if the desired attribute is present. I hope this helps make Pipelines easier to use and explore : ). You can find a Jupyter notebook with some of the code samples for this piece here. As with all my posts if you get stuck please comment here or message me on LinkedIn I’m always interested to hear from folks. Happy Coding!

Python
Machine Learning
Data Science
Editors Pick
Getting Started
Recommended from ReadMedium