Shapley values and SHAP

Marc Deveaux

9 min readJul 6, 2024

Sources

“General Pitfalls of Model Agnostic Interpretation Methods for Machine Learning Models”, 2021, Christoph Molnar
“A Unified Approach to Interpreting Model Predictions”, 2017, Scott Lundberg & Su-In Lee
https://christophm.github.io/interpretable-ml-book/shapley.html#the-shapley-value-in-detail
https://h2o.ai/blog/2022/shapley-values-a-gentle-introduction/
https://www.aidancooper.co.uk/how-shapley-values-work/
https://www.aidancooper.co.uk/a-non-technical-guide-to-interpreting-shap-analyses/?xgtab&
https://www.ijcai.org/proceedings/2022/0778.pdf
https://whimsical.com/12th-march-christoph-molnar-Uf4rpjDJqAiEv8FePJHg6j
https://www.youtube.com/watch?v=jhopjN08lTM

Background on Explanation models

The best explanation of a simple model is the model itself (white box models)
For complex models (black box models), we cannot use the original method as its own best explanation because it is too hard to understand
Instead, we can use a simpler explanation model which is an interpretable approximation of the original model
Explanation models can focus on Local explanation (a single instance) or Global explanation (explain the overall model behavior)
There are multiple explainers, the most famous being LIME and Shapley values
Both are additive feature attribution methods: they attribute an effect to each feature (like a coefficient in a linear regression) and summing the effects of all feature attributions approximates the output of the original model

Why using Shapley Values

Shapley Values come from Coalition game theory and tell us how to fairly distribute a payout (i.e how a single prediction differs from the average prediction) among the players (i.e the features of that instance)

A fair payout is defined by the following properties:

Symmetry: give equal features equal payout. It means we give the same credit to 2 features that are completely interchangeable
Dummy: give zero credit to irrelevant features
Additivity: values can be added across games. If your model prediction is the sum of 2 component models, then the Shapley values of your model should be the sum of the Shapley values of your component models. This is useful for analyzing all ensemble models, because you can calculate the Shapley values of every component decision tree and sum them up to analyse the behavior of the overall model
Efficiency: contributions must add up to the payout. It means that if you sum up the Shapley values across all feature values of a single instance, you will have the same number as the difference between the average prediction and the prediction of that single instance. Therefore, Shapley values are in the same units as the prediction. So if your prediction unit is price, the Shapley values of all the variables are expressed in EUR (rather than each variable related scale)

Those properties are related to additive feature attribution methods and it turned out that Shapley values is the only method that matches all those criteria. Therefore, it can be used as the “unification” of all other interpretable ML methods which only possess some of those properties.

Shapley value concept

Shapley value: the average marginal contribution of a feature across all possible coalitions (i.e possible combinations of other features)

Analogy: the feature values enter a room in random order. All feature values in the room participate in the game (i.e contribute to the prediction). The Shapley value of a feature value is the average change in the prediction that the coalition already in the room receives when the feature value joins them

Example: a model predicts an apartment’s price. Average prediction is 310.000 EUR. I have an apartment which is on the 2nd floor, 50m2, no cat allowed and it is next to a park. The prediction for this apartment was 300.000 EUR. Shapley values answer the question: how much each feature value contributed to the prediction versus your average prediction

For example in linear regression if you want to know the impact of the feature j for a single instance, you would take the feature effect (coefficient multiplicated by the data point of the feature j) minus the average feature effect. So Shapley value is doing the same thing

Shapley Value formula

Calculating Shapley Values

A good article on how to calculate classic Shapley values is https://www.aidancooper.co.uk/how-shapley-values-work/

To summarize:

we create the set of all possible feature combinations
for each combination, we will train a model using the selected features
the constant (i.e base value) is the same for each model and is the average prediction value

Example of an instance named house1 of a house prediction model with 3 features A, B and C

MC stands for Marginal Contribution and illustrates the difference between 2 models when the feature A is added. For example, when the feature A is added to the model(A,B,C), the prediction price is increasing by $0.6K vs model(B,C)
To calculate the Shapley value of feature A for house1 we use the weighted sum of the marginal contributions. Weights comes from the count of connection at each layer and it sums to 1

Do the same thing for Shapley value B and C to find back the prediction of house1

Computation issue and Approximation

In practice you need approximation because it would take too much time to retrain the model every single time. “All possible coalitions (sets) of feature values have to be evaluated with and without the j-th feature to calculate the exact Shapley value. For more than a few features, the exact solution to this problem becomes problematic as the number of possible coalitions exponentially increases as more features are added”. To calculate Shapley values, you must retrain your machine learning model 2^F times (where F is the number of features)! So for each family of algorithm, you need to find tricks on how to replace the features that are supposed to be masked. One method is to approximate by extrapolating: for each missing feature of the coalition, we substitute it with a random value found in the training dataset for this given feature. This means that you potentially create some “Frankenstein” instances that don’t make sense (the example given was a baby earning a salary of 100.000 USD because the salary was extrapolated).

Before SHAP one of the popular methods to approximate Shapley value was Monte Carlo Permutation Sampling (“Explaining prediction models and individual predictions with feature contributions”, 2014). One drawback to this method is that you need a large number of samples.

SHapley Additive exPlanation

A good source to understand SHAP is https://christophm.github.io/interpretable-ml-book/shap.html

SHAP is a unification of additive feature attribution methods (LIME, DeepLIFT, Classic Shapley Values and Layer Wise propagation). Additive feature attribution methods attribute an effect to each feature and sum the effects of all feature attributions to approximate the output of the original model
It comes from the paper “A Unified Approach to Interpreting Model Predictions” paper from 2017
“SHAP encompasses a range of techniques for efficiently approximating Shapley values, by combining them with local interpretability methods such as LIME”. They do this notably with KernelSHAP (kernel based estimation approach inspired by local surrogate models) and treeSHAP (estimation for tree based model)

Quick note on Surrogate models and LIME:

Surrogate model: “interpretable models designed to “copy” the behavior of the ML model. Th surrogate approach treats the ML model as a black-box and only requires the input and output data of the ML model to train a surrogate ML model. However, the interpretation is based on analyzing components of the interpretable surrogate model. […] LIME is an example of a local surrogate method that explains individual predictions by learning an interpretable model with data in proximity to the data point to be explained”

LIME: Local Interpretable Model Agnostic explanation (LIME) is a paper in which the authors propose a concrete implementation of local surrogate models. […] Instead of training a global surrogate model, LIME focuses on training surrogate models to explain individual predictions

KernelSHAP: a Kernel based approximation where you reframe the Shapley values as parameters of a linear model. You sample coalitions, predict on each and fit a weighted linear model on the samples. The resulting coefficients are the Shapley Values. Note that the concept is highly like LIME where we fit a simple model to approximate a complex model. One of the differences with LIME is that you have a binary mask which tells you which features are included. Small and large coalitions gain more weight.

TreeSHAP: take advantages of the structure of individual trees of ensemble models. It is faster than KernelSHAP and give you the exact solution

SHAP library’s main plots

Feature importance plot: “Features with large absolute Shapley values are important. Since we want the global importance, we average the absolute Shapley values per feature across the data”

Shap summary plot: “The summary plot combines feature importance with feature effects. Each point on the summary plot is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The color represents the value of the feature from low to high. Overlapping points are jittered in y-axis direction, so we get a sense of the distribution of the Shapley values per feature. The features are ordered according to their importance.”

Shap dependence plot: “1) Pick a feature. 2) For each data instance, plot a point with the feature value on the x-axis and the corresponding Shapley value on the y-axis”

Interpretation

The Shapley value is the average contribution of a feature value to the prediction in different coalitions. The Shapley value is NOT the difference in prediction when we would remove the feature from the model

Pitfalls

Feature dependencies: when you have strong correlation between features, SHAP (thanks to the symmetry property) will divide the feature effect in 2; it is not necessarily a bad thing but it is worth keeping in mind when interpreting the feature impact.

Causal Inference: Shap is not a measure of how important a given feature is in the real world; it is simply how important a feature is to the model. A model is not always a good representation of reality as it can use proxy variables. For example, country of origin can be used to predict possibility to get skin cancer, when in fact it is a proxy for the amount of sunshine people receive

Bad model Generalization: “Under or overfitting models will result in misleading interpretations regarding true feature effects and importance score”. […] “Make sure your model is properly fit. Interpretations only as good as underlying model”

Unnecessary Use of Complex Models: “A common mistake is to use an opaque, complex ML model when an interpretable model would have been sufficient, i.e when the performance of interpretable models is only negligible worse”. […] We recommend starting with simple, interpretable models such as (generalized) linear models, LASSO, generalized additive models, decision trees or decision rules and gradually increase complexity in a controlled, step wise manner”. Quick thoughts: I like the idea but in a lot of contexts, performance comes first and then (if it is good enough) we talk about explainability.

Ignoring Feature Dependence:

Interpretation with Extrapolation: “When features are dependent, perturbation based IML methods such as PFI (Permutation Feature Importance) and PDP (Partial Dependent Plot) extrapolate in areas where the model was trained with little or not training data, which can cause misleading interpretations. Perturbations produce artificial data points that are used for model predictions, which in turn are aggregated to produce global interpretations”
Confusing correlation with dependence: “Features with a Person correlation coefficient (PCC) close to zero can still be dependent and cause misleading model interpretations. While independence between two features implies that the PCC is zero, the converse is generally false. The PCC, which is often used to analyze dependence, only tracks linear correlations and has other shortcomings such as sensitivity to outliers. Any Type of dependence between features can have a strong impact on the interpretation of the results of the IML method”