Summary

The author has applied Sobol and SHAP sensitivity analysis methods to a California Housing Prices dataset model to determine feature importance, finding inconsistencies between the two methods' results, particularly in the ranking of less impactful features.

Abstract

The author recently employed both Sobol and SHAP sensitivity analysis techniques on a RandomForestRegressor model for predicting housing prices in California, aiming to validate the most influential features. Despite using a well-established dataset and a robust model, the author encountered discrepancies in the feature importance scores from the two methods. Sobol indicated that MedInc, Latitude, and Longitude were highly impactful, while SHAP confirmed MedInc's importance but highlighted AveOccup as the second most influential factor. The author has investigated the correct usage of both methods and sought theoretical explanations for the differences, referencing discussions on SHAP's GitHub and a relevant paper. Although the author acknowledges a preference for SHAP's explanations, the search for a definitive understanding of the discrepancies continues, with plans to re-evaluate the application of SHAP to the dataset.

Opinions

The author believes that using two different sensitivity analysis methods should yield more convincing results but was faced with inconsistencies that necessitated further investigation.
There is a clear expectation that the most impactful feature, MedInc, should be consistently identified by both methods, which was indeed the case.
The author questions the reasonableness of Sobol's results, which ranked Latitude and Longitude higher than HouseAge and AveRooms, features that intuitively seem more directly related to housing prices.
The author admits to not fully understanding the theoretical explanations provided in the SHAP GitHub discussions and the referenced paper, but leans towards SHAP's results being more convincing.
Acknowledging the complexity of the issue, the author plans to revisit the application of SHAP to their dataset, suggesting a need for a more nuanced understanding of how these methods should be applied and interpreted.

Differences between Sobol and SHAP Sensitivity Analysis on Housing Prices Predictions

I recently applied both Sobol and SHAP to my model to get the most impactful feature, because I thought using two methods could be more convincing. However, I got inconsistent importance scores from the two methods. Consequently, I have to figure out what is wrong; life is hard.

Test on a simple dataset(housing prices) to check if I use the methods correctly
Searching for theoretical explanation according to the result
Back to my problem/dataset/model to examine the differences

1. Test on a simple dataset

(1) A simple housing price prediction model

Dataset

I used the California Housing Prices dataset, since the classical Boston Housing dataset is unethical, and the California dataset is suggested in the below error notice.

This dataset was derived from the 1990 U.S. census, using one row per census group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

There are eight features, and the price is expressed in million.

# Import dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
data = pd.DataFrame(housing.data)
data.columns = housing.feature_names
data.head()
#Adding target variable to dataframe
data['PRICE'] = housing.target 
data.head()

The correlations show that the most impactful feature is MedInc, followed by AveRooms and HouseAge, which make sense.

Model (RandomForestRegressor)

# Spliting target variable and independent variables
X = data.drop(['PRICE'], axis = 1)
y = data['PRICE']

# Splitting to training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 4)

# Import Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor
reg = RandomForestRegressor()

# Train the model using the training sets 
reg.fit(X_train, y_train)

# Model prediction on train data
y_pred = reg.predict(X_train)

# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))

# Predicting Test data with the model
y_test_pred = reg.predict(X_test)

# Model Evaluation - Test set
acc_rf = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_rf)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

(2) Sobol — SALib

# pip install SALib

from SALib.sample import saltelli
from SALib.analyze import sobol
from SALib.test_functions import Ishigami
import numpy as np

# Define the model inputs
problem = {
    'num_vars': 8,
    'names': ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude'],
    'bounds': [[0.49990, 15.000100],
               [1.000000, 52.000000],
               [0.888889, 132.533333],
               [0.333333,34.066667],
               [3.000000,35682.000000  ],
               [0.692308,1243.333333 ],
               [32.550000,41.950000],
               [-124.350000,-114.470000],
               ]
}

# Generate samples
param_values = saltelli.sample(problem, 2**14)

# Run model (example)
Y = reg.predict(param_values)
# Perform analysis
Si = sobol.analyze(problem, Y)
total_Si, first_Si, second_Si = Si.to_df()

MedInc is impactful, Latitude and Longitude are more impactful than house age and AveRooms are not reasonable.

(3) Shapley Value — SHAP

import shap

shap.initjs()
model = reg
# explain the model
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X[0:120])
# visualize the impact of each features
shap.summary_plot(shap_values, X[0:120])

SHAP gives similar results that MedInc is important, and Latitude and Longitude are also more impactful than HouseAge. The difference is AveOccup is the second impactful factor.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

(4) The observations

The most impactful factor is obvious and the method/libraries are used correctly. The results indicate MedInc and location are crucial.

There are differences among the rest factors. Latitude is more influential than longitude based on Sobol, and AveOccup is more impactful according to SHAP.

I overlook one preparation: I should find a dataset with proven feature importance. Thankfully, although I am not sure if the results are correct, the differences exist based on the above comparison.

I also wonder if there is a way to use the data from the dataset instead of randomly drawn from a range.

2. Searching for a theoretical explanation

I found the thread in the SHAP GitHub issue page. I googled the differences between Sobol and SHAP, and that paper mentioned in their discussion was also at the top of search results.

https://github.com/slundberg/shap/issues/1382

https://artowen.su.domains/reports/sobolshapley.pdf

To be honest, I don’t understand Slundberg’s comment and the paper. But it seems SHAP is more convincing.

3. Back to my problem/dataset/model

According to the housing price dataset analysis, the differences in the importance scores between these two methods seem small because the top impactful feature is consistent.

This is not my case; I’d like to check how I used SHAP on my dataset again tomorrow.

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com