# “Approximate-Predictions” Make Feature Selection Radically Faster

## Feature selection is so slow because it requires the creation of many models. Find out how to make it blazingly faster thanks to approximate-predictions

When developing a machine learning model, we usually start with a large set of features resulting from our feature engineering efforts.

**Feature selection is the process of choosing a smaller subset of features that are optimal for our ML model**.

Why doing that and not just keeping all the features?

**Memory**. Big data take big space. Dropping features means that you need less memory to handle your data. Sometimes there are also external constraints.**Time**. Retraining a model on less data can save you much time.**Accuracy**. Less is more: this also goes for machine learning. Including redundant or irrelevant features means including unnecessary noise. Frequently, it happens that a model trained on less data performs better.**Explainability**. A smaller model is more easily explainable.**Debugging**. A smaller model is easier to maintain and troubleshoot.

Now, the main problem with feature selection is that it is **very slow because it requires training many models**.

In this article, we will see a trick that makes feature selection extremely faster thanks to “approximate-predictions”.

# A very hard problem

Let’s try to visualize the problem of feature selection. We start with *N* features, where *N* is typically hundreds or thousands.

Thus, the output of feature selection can be seen as an array of length *N* made of “yes”/“no”, where each element of the array tells us whether the corresponding feature is selected or not.

The process of feature selection consists of trying different “candidates” and finally picking the best one (according to our performance metric).

Since we have *N* features and any of them can be either selected or not selected, this means that **we have 2^ N possible candidates**.

This number becomes huge quickly. For example, with just 50 features and supposing that evaluating a candidate requires on average 1 second, trying all the possible candidates would require 35 million years!

Thus, it should be clear why, in most practical cases, the number of candidates that are evaluated is just a tiny fraction of all the possible candidates.

# Candidate proposal and evaluation

Many feature selection methods exist, but all of them can be framed as an iterative process made of two steps:

- Proposing a new candidate
- Evaluating the candidate

Usually, all the attention is placed on the first step. The purpose of Step 1 is usually to find candidates that will likely perform well, based on what we have learned so far.

In this article, however, we will completely disregard Step 1 to focus solely on Step 2: candidate evaluation. To this aim, **we will propose new candidates completely at random**. In particular, we will use the following function to propose new candidates:

```
def get_random_candidate(features):
"""Get a random set of features."""
candidate_size = np.random.randint(1, len(features)+1)
candidate = np.random.choice(features, replace=False, size=candidate_size)
return candidate
```

Regarding Step 2, we will compare two strategies of evaluation based on different types of predictions:

- Exact-predictions.
- Approximate-predictions.

Don’t worry if you are not familiar with these terms now, the next paragraphs will make things clearer.

# Feature selection based on “exact-predictions”

Given a new candidate, all the feature selection methods you have seen so far probably follow this structure:

- Train a model on the data frame made of the training observations and the candidate features.
- Use the model to get the predictions on the validation set.
- Compute the performance metric on the validation set.

Graphically, this is equivalent to:

This approach is based on **“exact-predictions” because we obtain the actual predictions made by the model trained exclusively on the candidate features**.

This process, translated in Python, would look more or less like this:

```
for iteration in n_iterations:
# Step 1: Propose a candidate (i.e. a list of features).
candidate = get_random_candidate(X_train.columns)
# Step 2: Train a model on those features
model = model.fit(X_train.loc[:, candidate], y_train)
# Step 3: Get the predictions made by the model on the validation set.
preds_valid = model.predict_proba(X_valid.loc[:, candidate])
# Step 3: Compute the performance metric on the validation set.
ap_valid = average_precision_score(y_valid, preds_valid)
```

As you can see, a new model is trained at each iteration, making this process very slow.

So, is there a way to exploit the information we have about the features, without having to train a new model at each iteration?

This is where “approximate-predictions” come into play.

# The intuition behind approximate-predictions

To help ourselves understand “approximate-predictions”, let’s use an example dataset from Pycaret (a Python library under MIT license). The dataset is called “Concrete”, and the task is to predict the strength of concrete given some of its features.

Let’s start by dividing the observations into a training set and a validation set.

```
from pycaret.datasets import get_data
from sklearn.model_selection import train_test_split
df = get_data("concrete", verbose=False)
X, y = df.drop("strength", axis=1), df["strength"]
X_train, X_valid, y_train , y_valid = train_test_split(X, y)
```

We can train a model on the training dataset (I will use LightGBM, but any model will work):

```
from lightgbm import LGBMRegressor
model = LGBMRegressor(verbose=-1).fit(X_train, y_train)
```

Since we have a trained model, we can compute the SHAP values (if you don’t know about the topic, you can read my introduction to SHAP values):

```
from shap import TreeExplainer
shap_explainer = TreeExplainer(model)
shap_expected_value = shap_explainer.expected_value
shap_valid = shap_explainer.shap_values(X_valid)
```

We can easily display the SHAP values:

Each SHAP value represents the contribution brought by a feature to the final prediction for that observation. For example, if we take the first feature of the first row (-14.708), this means that that feature lowers the final prediction by -14.708.

**The most important property of SHAP values is that they are additive**. This means that if we sum the SHAP values of the first row (-14.708185 +7.572576 -0.366994 +…), we obtain exactly the prediction made by the model for that row.

This holds true for all the rows. Don’t believe me? You can check it yourself with the following piece of code:

```
import numpy as np
pred_valid = model.predict(X_valid)
pred_valid_proxy = shap_expected_value + shap_valid.sum(axis=1)
assert (np.abs(approx_pred_valid - pred_valid) < 1e-10).all()
```

This proves that by summing the SHAP values of any individual we obtain exactly the model prediction for that individual (there is actually a small rounding difference at the tenth decimal, which is negligible).

**We can exploit the additive property of SHAP values to simulate the predictions that a model trained on a particular subset of features would produce**.

Let’s say that we want to answer the following question: “What predictions would the model make if it had just the features *Fly Ash*, *Water,* and *Age*?”

SHAP values allow us to answer this question. In fact, since they are additive and keep into account the feature interactions, **it’s enough to sum the SHAP values relative to those features to obtain our estimate of the predictions that a model trained only on those features would produce**.

```
candidate_features = [...]
approx_pred_valid = shap_expected_value + shap_valid[candidate_features].sum(axis=1)
```

Of course, this is just an approximation! If we wanted the exact-predictions, we would need to train a new model specialized on the candidate features only. This is why I am calling the **predictions obtained in this way “approximate predictions”**.

But how are approximate-predictions useful for feature selection?

# Feature selection based on “approximate-predictions”

**Approximate-predictions allow us to simulate any possible candidate of features without having to train a new model**. All we need is the SHAP values of the model trained on all the features.

Can you see why this is a game-changer? **With exact-predictions, we needed to train a new model at each iteration. Instead, to obtain approximate-predictions, we just need to sum some columns of a data frame! **This makes the process incredibly faster.

Graphically, this is what happens with approximate-predictions:

Translated in Python:

```
# Step 0: Train a model on the dataset with all the features and compute the
# SHAP values on the validation set.
shap_valid = model.fit(X_train, y_train).get_shap_values(X_valid)
for iteration in n_iterations:
# Step 1: Propose a candidate (i.e. a list of features).
candidate = get_random_candidate(X_train.columns)
# Step 2: Get the predictions on the validation set
# (by summing the SHAP values of the respective features).
pred_valid = shap_valid.loc[:, candidate].sum(axis=1)
# Step 3: Compute the performance metric on the validation set.
ap_valid = average_precision_score(y_valid, pred_valid)
```

As you can see, in this case, just one model is trained at the beginning. Then, at each iteration, we just perform a simple sum of columns, which is of course much faster than training a brand-new model.

This seems amazing, but we must remember that summing the SHAP values of some features is just like obtaining a *proxy* of the real predictions that we would have if we trained a model exclusively on those features.

So, as with any approximation, the question becomes: **is the approximation good enough for our purposes?**

# Is the proxy good enough?

To answer this question, let’s take an example dataset from Pycaret (a Python library under MIT license).

The dataset is called “heart” and contains 15 features:

- AGE_50
- MD_50
- SBP_50
- DBP_50
- HT_50
- WT_50
- CHOL_50
- SES
- CL_STATUS
- MD_62
- SBP_62
- DBP_62
- CHOL_62
- WT_62
- IHD_DX

Using these features, I randomly generated 50 different candidates (i.e. 50 different sets of features). As a performance metric, I used average precision. For each candidate, I tried both the Exact-Prediction and the Approximate-Prediction method and, for both of them, I computed:

- Actual AP: the average precision computed using the exact-predictions.
- Predicted AP: this is the average precision computed using the approximate-predictions.

We can say that the proxy is good if the Predicted AP is very similar to the Actual AP.

Let’s visualize the 50 candidates with their Predicted AP and Actual AP.

By way of example, I put some labels showing which features are included in that candidate.

Out of curiosity, let’s also visualize how many features every candidate contains.

Predicted AP and Actual AP seem very correlated, this is good news!

Let’s measure it. I will use the Spearman correlation instead of the Pearson correlation because, in this case, we are more interested in the relative order of the candidates rather than in their linear relationship.

The correlation in this case is 89%, so really high. This is good news for us because it means that **if we select the best candidate using approximate predictions, this is probably also the best candidate (or one of the best candidates) according to exact predictions.**

We can repeat the same procedure also for some other datasets that are in Pycaret. For each dataset, I randomly draw 50 feature set candidates and measure both the Predicted AP and the Actual AP.

These are the results:

At a glance, it seems that all the datasets have a strong correlation. This again confirms our intuition.

But let’s do it more rigorously and compute the Spearman correlation between Predicted AP and Actual AP, for each dataset separately:

These numbers confirm our previous impression: the correlation coefficients are always very high, spanning between a minimum of 87% and a maximum of 100%.

So we can conclude that **approximate-predictions are actually a good proxy of exact predictions,** and we can use them to make feature selection much faster and, at the same time, reliable.

# Conclusions

Any feature selection method consists at least of two steps: proposing a new candidate set of features and evaluating that candidate.

In this article, we focused on the second step (evaluation) showing how to leverage SHAP values to obtain “approximate-predictions”. This approach allows to obtain an estimate of the “exact-predictions” that we would obtain if we trained a different model specialized on each set of features.

The benefit is that approximate-predictions are obtained with a simple sum, thus making the evaluation step much faster and allowing to evaluate many more candidates.

We also showed that the approximate-predictions are reliable enough since the performance metric that we obtain with this method is very much correlated with the performance metric we would obtain using exact-predictions.

*You can reproduce all the code used for this article with this notebook.*

*Thank you for reading!*

*If you find my work useful, you can subscribe to get an email every time that I publish a new article (usually once a month).*

*Want to show me your support for my work? You can buy me a cappuccino.*

*If you’d like, add me on Linkedin!*