Finding and Visualising Interactions
Analysing interactions using feature importance, Friedman’s H-statistic and ICE Plots
The side effects of medication can depend on your gender. Inhaling asbestos increases the chance of lung cancer more for smokers than non-smokers. If you are more moderate/liberal your acceptance of climate change tends to increase with higher levels of education. The opposite is true for the most conservative. These are all examples of interactions in data. Identifying and incorporating these can drastically improve the accuracy and change the interpretation of your models.
In this article, we explore different ways of analysing interactions in your dataset. We discuss how to use scatterplots and ICE Plots to visualise them. We then move onto methods of finding/highlighting potential interactions. These include feature importance and Friedman’s H-statistic. You can find the R code used for this analysis on GitHub. Before we start, it’s worth explaining exactly what we mean by an interaction.
What are interactions?
We say that a feature is predictive when it has some sort of relationship with the target variable. For example, the price of a car may decrease as the car ages. Age (feature) can be used in a model to predict car price (target variable). In some cases, the relationship between the target variable and feature depends on the value of another feature. This is known as an interaction between features.
Take, for example, the relationship between age and car price in Figure 1. Here we have a second feature — car type. A car can either be a classic car (classic=1) or just a regular car (classic =0). For regular cars, price decreases with age but for classic cars age actually increases their value. The relationship between price and age depends on the car type. In other words, there is an interaction between age and car type.
Incorporating interactions like these can improve the accuracy of our models. Non-linear models, like Random Forests, can automatically model interactions. We can simply include age and car type as features and the model will incorporate interactions into its predictions. For linear models, like Linear Regression, we have to add explicit interaction terms. To do this we first need to know what interactions are present in our data.
Dataset
To explain the techniques, we have randomly generated a dataset with 1000 rows. The dataset includes the 5 features listed in Table 1. These are used to predict an employee’s annual bonus. We have designed the dataset so there is an interaction between experience and degree and between performance and sales. days_late does not interact with any of the other features.
The two interactions are different due to the nature of the relevant features. Degree is categorical and experience is continuous. So, we have an interaction between a categorical and continuous feature. For the other interaction, we have two continuous features. We will see that we still analyse these interactions in the same way.
Visualising interactions
We can start by visualising these interactions using simple scatter plots. In Figure 2, we see the interaction between experience and degree. If an employee has a degree their bonus tends to increase with experience. In comparison, when an employee does not have a degree, there is no relationship between these features. If this was a real dataset we would expect there to be an intuitive explanation for this. For example, educated workers may take on roles where experience is more valued.
Similarly, we can see the interaction between sales and performance in Figure 3. In this case, it may not be as clear. We now have a gradient colour scheme where darker points indicate lower performance ratings. In general, bonus tends to increase with sales. Taking a closer look you will notice that the lighter points have a steeper slope. For higher performance ratings, there will be a larger increase in bonus for an increase in sales.
Visualising interactions this way can be intuitive but it will not always work. We are visualising the relationship between the target variable and only two features. In reality, the target variable may have relationships with many features. This and the presents of statistical variation means the scatter plot points will be spread around the underlying trends. We can already see this in the charts above and, in a real dataset, this will be even worse. Ultimately, to clearly see interactions we need to strip out the effect of other features and statistical variation.
ICE Plots
This brings us to Individual Conditional Expectation (ICE) plots. To create an ICE plot we start by fitting a model to our data. In our case, we have used a Random Forest with 100 trees. In Table 2, we have two rows in our dataset used to train the model. In the last column, we can see the predicted bonus for each of the employees. That is the prediction made by the Random Forest given the feature values. To create an ICE plot we vary the value of one feature while holding the others constant and plot the resulting predictions.
Looking at Figure 4, this may make more sense. Here we have taken the two employees in Table 2. We have plotted the predicted bonus for each possible value of days_late while keeping the original values of the other features. (i.e. experience will remain at 31 and 35 years for the first and second employee respectfully). The two black points correspond to the actual predictions in Table 2 (i.e. for their real days_late value).
Finally, to obtain the ICE Plot we follow this process for every row in our dataset. We also center each line so that they start at 0 on the y-axis. The bold line gives what is known as the partial dependence plot (PDP). This is the average partial yhat (centered) value at each days_late value. Looking at the PDP, as days_late increases, the predicted bonus tends to decrease. We can also see that most of the individual predictions follow this trend. If days_late interacted with another feature we would not expect this. We would have groups of predictions following different trends.
You can see what we mean by looking at the ICE Plot for experience in Figure 6. Here there are two distinct trends. Those employees for which the predicted bonus increases with experience and those for which it does not. By colouring the plot by degree (i.e. blue for degree and red otherwise), you can clearly see that this is due to the interaction between experience and degree.
We can create a similar plot for the sales-performance interaction. Here the line is coloured blue if the employee’s performance rating is above 5 and red if is below 5. The predicted bonus tends to increase for all employees but at a slower rate for those with lower performance ratings. For both ICE Plots, the interactions are clearer that when using their corresponding scatter plots.
These plots are powerful as, by holding the other feature values constant, we can focus on the trends of one feature. That is how predictions change due to changes in this feature. Additionally, the Random Forest will model the underlying trends in the data and makes predictions using these trends. Hence, as we are plotting predictions we are able to strip out the effect of statistical variation.
Getting the most out of ICE Plots
We have used a Random Forest but ICE Plots are actually a model agnostic technique. This means we can use any model when creating them. However, the model should be non-linear (i.e. XGBoost, Neural Network). Linear models can not model interactions in the way necessary to create these plots. The choice of model is not important but, depending on your dataset, different models may be better at capturing the underlying interactions.
The accuracy of the model you use is also not that important. The goal is to visualise interactions and not make accurate predictions. However, the better your model the more reliable your analysis will be. An underfitted model may not capture the interactions and overfitted model may present interactions that are not actually there. Ultimately, it is important to test your model using k-fold cross-validation or a test set. For example, you can see the plot of predicted vs actual bonus values for our Random Forest in Figure 8. The model is not perfect but we are able to capture underlying trends.
Just using ICE Plots may not be enough to find interactions. Depending on the size of your dataset, the number of possible interactions might be large. For example, if you have 20 features you will have 174 possible pairwise interactions. Visualising and trying to analyse all these ICE plots will be incredibly tedious. So we need a way of highlighting/ narrowing down our search. In the rest of the article we discuss how we can do this using feature importance, Friedman’s H-statistic and domain knowledge.
Finding interactions
Feature importance
Feature importance is a score based on how much a particular feature has improved the accuracy of a model. If we include interactions terms in our dataset we can calculate the feature importance for these terms. We do this by adding the pairwise product of each feature ( i.e. experience*degree). We then train a model using all interactive features and calculate the resulting feature importance.
In Figure 9, you can see the feature importance of the 10 interaction features and 5 original features. Here we have used a Random Forest as our model and percentage increase in MSE as our feature importance score. We can see that both the experience-degree and sale-performance interaction terms have the highest importance. This suggests there is an interaction between these terms.
You may also notice, that some of the other interaction terms are also important (i.e. experience.sales). We would not have expected this as, when generating the dataset, we did not include an interaction between these two features. Figure 10, below helps to explain why we are getting this result. Notice that both experience and sales have a positive relationship with bonus. This means the product of these features has a positive relationship.
This highlights a disadvantage of this method. The effect of a feature on a prediction can be broken down into two parts. The first is the effect it has directly on the prediction (i.e. main effect). The second is the effect it has through interactions with other features (i.e. interaction effect). The experience.sales interaction term has a high feature importance because of the main effect of the two individual features. So, we need a way of isolating the interaction effect from the main effect.
Friedman’s H-statistic
The Friedman's H-statistic does exactly that. To give an overview of how it is calculated we start by fitting a model. In our case, we use the same Random Forest used to create the ICE Plots. We then compare the observed partial dependence function to the partial dependence function under the assumption that there are no interactions. Large differences in the two function suggest that there are interactions.
There are two versions of the statistic. The first gives a measure of a feature’s effect through interactions with all other features. You can see the values for this statistic in Figure 11. A value of 1 means that a feature only has an effect on the prediction through interactions (i.e. there is no main effect). A value of 0 means there is no interaction (i.e. only main effect). For experience, we have an H-statistic of 0.28. We can interpret this as meaning 28% of the effect of experience comes through this feature’s interaction with the other features.
The second version of the H-statistic gives a measure of the interaction between two features. The first chart, in Figure 12, gives the H-statistic for experience and the other features. We can see that interaction between degree and experience is most significant. Similarly, the second chart gives the H-statistics for sales. Again, as expected, we can see that the interaction with performance is the most significant.
The idea is to first use the overall H-statistic to get an idea of which features have interactions. We can then use the charts of the second H-statistic to identify the other features they are interacting with. No statistic is perfect and this process may not always work. You can see that the overall H-statistic for sales is quite low. It is only 0.11. This is close the that H-statistic of days_late which has no interactions. So, following the process, we may have decided not to analyse sales further and miss the interaction.
Domain Knowledge
As we’ve seen above, relying solely on these methods we may miss some interactions or identify interactions that are not actually there. That is why it is important to incorporate any domain knowledge you have of the field into the process. You may already know of some interactions that you can confirm using these techniques. You should also sense-check any new interactions that are found. They should have an intuitive explanation for why they exist. Hopefully, by using a combination of domain knowledge and these statistical techniques you’ll be able to find some useful interactions.
Become a Referred Member
If you found this article helpful and want to see more, you can support me by becoming one of my referred members
Image Sources
All images are my own or obtain from www.flaticon.com. In the case of the latter, I have a “Full license” as defined under their Premium Plan.
References
[1] A. Goldstein, et al., Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation (2014), https://arxiv.org/pdf/1309.6392.pdf
[2] C. Molnar, Interpretable Machine Learning(2021) https://christophm.github.io/interpretable-ml-book/interaction.html
[3] Interaction(statistics) (2021), https://en.wikipedia.org/wiki/Interaction_(statistics)
[4] Stat Trek, Interaction Effects in Regression (2021), https://stattrek.com/multiple-regression/interaction.aspx