Measuring feature importance, removing correlated features
Linear model like linear regression or logistic regression: Identify the coefficients ( β ) in the linear regression equation for each input feature. The magnitude of the coefficients represents the relative importance of the features. Larger absolute values indicate stronger impacts.
Decision Tree Models: Use information gain/Gini importance
Pros and cons of using Gini importance: Because Gini impurity is used to train the decision tree itself, it is computationally inexpensive to calculate. However, Gini impurity is somewhat biased toward selecting numerical features (rather than categorical features). It also does not take into account the correlation between features. For example, if two highly correlated features are both equally important for predicting the outcome variable, one of those features may have low Gini-based importance because all of it’s explanatory power was ascribed to the other feature. This issue can be mediated by removing redundant features before fitting the decision tree.
Other measures of feature importance:
Aggregate methods
Random forests are an ensemble-based machine learning algorithm that utilize many decision trees (each with a subset of features) to predict the outcome variable. Just as we can calculate Gini importance for a single tree, we can calculate average Gini importance across an entire random forest to get a more robust estimate.
Permutation-based methods
Another way to test the importance of particular features is to essentially remove them from the model (one at a time) and see how much predictive accuracy suffers. One way to “remove” a feature is to randomly permute the values for that feature, then refit the model. This can be implemented with any machine learning model, including non-tree-based- methods. However, one potential drawback is that it is computationally expensive because it requires us to refit the model many times.
Tree-based models provide an alternative measure of feature importances based on the mean decrease in impurity (MDI). Impurity is quantified by the splitting criterion of the decision trees (Gini, Log Loss or Mean Squared Error). However, this method can give high importance to features that may not be predictive on unseen data when the model is overfitting. Permutation-based feature importance, on the other hand, avoids this issue, since it can be computed on unseen data.
Furthermore, impurity-based feature importance for trees are strongly biased and favor high cardinality features (typically numerical features) over low cardinality features such as binary features or categorical variables with a small number of possible categories.
Permutation-based feature importances do not exhibit such a bias. Additionally, the permutation feature importance may be computed performance metric on the model predictions and can be used to analyze any model class (not just tree-based models).
When two features are correlated and one of the features is permuted, the model will still have access to the feature through its correlated feature. This will result in a lower importance value for both features, where they might actually be important.
One way to handle this is to cluster features that are correlated and only keep one feature from each cluster.
Coefficients
When we fit a general(ized) linear model (for example, a linear or logistic regression), we estimate coefficients for each predictor. If the original features were standardized, these coefficients can be used to estimate relative feature importance; larger absolute value coefficients are more important. This method is computationally inexpensive because coefficients are calculated when we fit the model. It is also useful for both classification and regression problems (i.e., categorical and continuous outcomes). However, similar to the other methods described above, these coefficients do not take highly correlated features into account.
Feature importance is an important part of the machine learning workflow and is useful for feature engineering and model explanation, alike!
Removing Correlated features:
One way to handle multi-collinear features is by performing hierarchical clustering on the Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster.
Another approach for linear models is to use L1 regularization and remove features with low coefficients.
What about Neural Nets / Deep learning models?
Approaches like permutation would work for DL mnodels as well.
For image/NLP models we can also try things like change the prediction of the model and backprop the loss with model weights frozen so that it backprops to the model inputs. This way you know which parts of inputs are responsible for the change in prediction.
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions
SHAP values are used when you have a complex model that takes some features as input and produces some predictions as output. This could be a gradient boosting, a neural network, or anything else.
The value represents the feature’s contribution to the prediction. Features with positive values have a positive impact on the prediction, while those with negative values have a negative impact. The magnitude of the value measures how strong the effect is.
SHAP values require an entire sample to calculate. This means that SHAP requires multiple observations, while LIME only requires one observation
Paper: https://papers.nips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf
Summary from the paper:
We present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
Let f be the original prediction model to be explained and g the explanation model. Here, we focus on local methods designed to explain a prediction f (x) based on a single input x
Explanation models often use simplified inputs x′ that map to the original inputs through a mapping function x = hx(x′). Local methods try to ensure g(z′) ≈ f (hx(z′)) whenever z′ ≈ x′
To get an overview of which features are most important for a model we can plot the SHAP values of every feature for every sample. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low).
Examples on how to use the SHAP library here:
https://christophm.github.io/interpretable-ml-book/shapley.html#shapley
LIME:
The idea is quite intuitive. First, forget about the training data and imagine you only have the black box model where you can input data points and get the predictions of the model. You can probe the box as often as you want. Your goal is to understand why the machine learning model made a certain prediction. LIME tests what happens to the predictions when you give variations of your data into the machine learning model. LIME generates a new dataset consisting of perturbed samples and the corresponding predictions of the black box model. On this new dataset LIME then trains an interpretable model, which is weighted by the proximity of the sampled instances to the instance of interest. The learned model should be a good approximation of the machine learning model predictions locally, but it does not have to be a good global approximation.