Getting Started
How to Get Feature Importances from Any Sklearn Pipeline
Pipelines can be hard to navigate here’s some code that works in general.
Introduction
Pipelines are amazing! I use them in basically every data science project I work on. But, easily getting the feature importance is way more difficult than it needs to be. In this tutorial, I’ll walk through how to access individual feature names and their coefficients from a Pipeline. After that, I’ll show a generalized solution for getting feature importance for just about any pipeline.
Pipelines
Let’s start with a super simple pipeline that applies a single featurization step followed by a classifier.
from datasets import list_datasets, load_dataset, list_metrics
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
# Load a dataset and print the first examples in the training set
imdb_data = load_dataset('imdb')
classifier = svm.LinearSVC(C=1.0, class_weight="balanced")
model = Pipeline(
[
("vectorizer", TfidfVectorizer()),
("classifier", classifier),
]
)
x_train = [x["text"]for x in imdb_data["train"]]
y_train = [x["label"]for x in imdb_data["train"]]
model.fit(x_train, y_train)
Here we use the excellent datasets python package to quickly access the imdb sentiment data. This package put together by HuggingFace has a ton of great datasets and they are all ready to go so you can get straight to the fun model building.
The above pipeline defines two steps in a list. It first takes input and passes it through a TfidfVectorizer which takes in text and returns the TF-IDF features of the text as a vector. It then passes that vector to the SVM classifier.
Notice how this happens in order, the TF-IDF step then the classifier. You can chain as many featurization steps as you’d like. For example, the above pipeline is equivalent to:
model = Pipeline(
[
("vectorizer", CountVectorizer()),
("transformer", TfidfTransformer()),
("classifier", classifier),
]
)
Here we do things even more manually. First, we get counts of every word, second, we apply the TF-IDF transformation, and finally, we pass this feature vector to the classifier. The TfidfVectorizer does those two in one step. But this illustrates the point. In a raw pipeline, things execute in order. We’ll discuss how to stack features together a little later. For now, let’s work on getting the feature importance for our first example model.
Feature Importances
Pipelines make it easy to access the individual elements. If you print out the model after training you’ll see:
Pipeline(memory=None,
steps=[('vectorizer',
TfidfVectorizer(...)
('classifier',
LinearSVC(...))],
verbose=False)
This is saying there are two steps, one named vectorizer
the other named classifier
. We can access these by looking at the named_steps
parameter of the pipeline like so:
model.named_steps["vectorizer"]
This will return our fitted TfidfVectorizer. Pretty neat! Most featurization steps in Sklearn also implement a get_feature_names()
method which we can use to get the names of each feature by running:
# Get the names of each feature
feature_names = model.named_steps["vectorizer"].get_feature_names()
This will give us a list of every feature name in our vectorizer. Then we just need to get the coefficients from the classifier. For most classifiers in Sklearn this is as easy as grabbing the .coef_
parameter. (Ensemble methods are a little different they have a feature_importances_
parameter instead)
# Get the coefficients of each feature
coefs = model.named_steps["classifier"].coef_.flatten()
Now we have the coefficients in the classifier and also the feature names. Let’s put them together into a nice plot.
import pandas as pd
# Zip coefficients and names together and make a DataFrame
zipped = zip(feature_names, coefs)
df = pd.DataFrame(zipped, columns=["feature", "value"])
# Sort the features by the absolute value of their coefficient
df["abs_value"] = df["value"].apply(lambda x: abs(x))
df["colors"] = df["value"].apply(lambda x: "green" if x > 0 else "red")
df = df.sort_values("abs_value", ascending=False)
And visualize:
import seaborn as sns
fig, ax = plt.subplots(1, 1, figsize=(12, 7))
sns.barplot(x="feature",
y="value",
data=df.head(20),
palette=df.head(20)["colors"])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=20)
ax.set_title("Top 20 Features", fontsize=25)
ax.set_ylabel("Coef", fontsize=22)
ax.set_xlabel("Feature Name", fontsize=22)
So we can see that negative unigrams seem to be the most impactful. That’s pretty cool. Getting these feature importance was easy. Let’s try a slightly more complicated example.
Get Feature Importances from a FeatureUnion
In most real applications I find I’m combining lots of features together in intricate ways. Earlier we saw how a pipeline executes each step in order. How do we handle multiple simultaneous steps? The answer is the FeatureUnion class. Let’s say we want to build a model where we take in TF-IDF bigram features but have some hand curated unigrams as well. (See my blog post on using models to find good unigrams here.) We can define this pipeline using a FeatureUnion. A FeatureUnion takes a transformer_list
which can be a list of transformers, pipelines, classifiers, etc. and then concatenates their results.
classifier = svm.LinearSVC(C=1.0, class_weight="balanced")
vocab = {"worst": 0, "awful": 1, "waste": 2,
"boring": 3, "excellent": 4}
model = Pipeline([
("union", FeatureUnion(transformer_list=[
("handpicked", TfidfVectorizer(vocabulary=vocab)),
("bigrams", TfidfVectorizer(ngram_range=(2, 2)))])
),
("classifier", classifier),
])
As you can see at a high level our model has two steps a union
and a classifier
. Inside the union
we do two distinct featurization steps. We find a set of hand picked unigram features and then all bigram features.
Extracting the features from this model is slightly more complicated. We have to go into the union, and then get all the individual features. Let’s try and do this by hand and then see if we can generalize to any arbitrary Pipeline. We already know how to access members of a pipeline, it’s the named_steps.
To get inside of the FeatureUnion we can look directly at the transformer_list
and step through each element. So the code would look something like this.
handpicked = (model
.named_steps["union"]
.transformer_list[0][1]
.get_feature_names())
bigrams = (model
.named_steps["union"]
.transformer_list[1][1]
.get_feature_names())
feature_names = bigrams + handpicked
Since the classifier is an SVM that operates on a single vector the coefficients will come from the same place and be in the same order. We can visualize our results again.
Looks like our bigrams were much more informative than our hand selected unigrams.
The General Case
So we’ve done some simple examples but now we want a way to do this for any (roughly any) Pipeline and FeatureUnion combination. For that we turn to our old friend Depth First Search (DFS). We are going to view a Pipeline as a tree. Each layer can have an arbitrary number of FeatureUnions but they will all stack up to a single feature vector in the end. There are roughly three cases to consider when traversing. The first is the base case where we are in an actual transformer or classifier that will generate our features. The second is if we are in a Pipeline. The third and final case is when we are inside of a FeatureUnion. Let’s talk about these in a little more depth.
Case 1: Featurization Step
Here we want to write a function which given a featurizer of some kind will return the names of the features. This is the base case in our DFS. In Sklearn there are a number of different types of things which can be used for generating features. Some examples are clustering techniques, dimensionality reduction methods, traditional classifiers, and preprocessors to name a few. Each one lets you access the feature names in a different way. For example, the text preprocessor TfidfVectorizer implements a get_feature_names
method like we saw above. However, most clustering methods don’t have any named features, they are arbitrary clusters, but they do have a fixed number of clusters. Let’s write a helper function that given a Sklearn featurization method will return a list of features.