Definition 1.

The average contribution score of a feature across all cases.

2. Explain

1.1 Linear Model:


f ^ ( x ) = Beta. 0 + Beta. 1 x 1 + . . . + Beta. p x p . j = 1 . . . . . p \hat{f}(x)=\beta_0+\beta_1x_1+… +\beta_px_p, j = 1,… ,p

Where xjx_jxj is the eigenvalue and βj\beta_jβj is the weight value of the corresponding eigenvalue JJJ. Then we calculate the predicted value f^(x)\hat{f}(x)f^(x) the contribution of the feature JJJ ϕj\phi_jϕj is:


ϕ j ( f ^ ) = Beta. j x j E ( Beta. j X j ) = Beta. j x j Beta. j E ( X j ) \phi_j(\hat{f})=\beta_jx_j-E(\beta_jX_j)=\beta_jx_j-\beta_jE(X_j)

Where E(βjXj)E(\beta_jX_j)E(βjXj) is the average effect estimate, and the contribution value is the difference between the characteristic effect and the average effect. When we sum the contribution values of each feature of a sample:


j = 1 p ϕ j ( f ^ ) = j = 1 p ( Beta. j x j E ( Beta. j x j ) ) = ( Beta. 0 j = 1 p Beta. j x j ) ( Beta. 0 j = 1 p E ( Beta. j X j ) ) = f ^ ( x ) E ( f ^ ( X ) ) \sum_{j=1}^{p}\phi_j(\hat{f})=\sum_{j=1}^{p}(\beta_jx_j-E(\beta_jx_j))=(\beta_0-\sum_{j=1}^{p}\beta_jx_j)-(\beta_0-\sum_ {j=1}^{p}E(\beta_jX_j))=\hat{f}(x)-E(\hat{f}(X))

So you can derive point X minus his average predicted value.

1.2 Shapley value

In game theory, Shapley Value is a solution to a single predictive computational feature contribution for all machine learning models. According to the original definition extended to machine learning, Shapley Value of an eigenvalue is the weight of all possible eigenvalue contributions, the sum of costs.


ϕ j ( v a l ) = S { x 1 . . . . x p } \ { x j } S ! ( p S 1 ) ! p ! ( v a l ( S { x j } ) v a l ( S ) ) \phi_j(val)=\sum_{S \subseteq \left \{ x_1,… x_p \right \} \backslash \left \{ x_j \right \} }\frac{\left | S \right |! (p-\left | S \right |-1)! }{p! }\left ( val(S\cup \left \{ x_j \right \}) -val(S)\right )

S is the subset of features used in all models, x is the vector of eigenvalues of a sample to be interpreted, valx(S)val_x(S) Valx (S) is the predicted value of eigenvalues in the subset S.

Shapley Value has four properties: Effectiveness characteristic contribution Value is the sum of the difference between the predicted Value and the mean of the predicted Value. Symmetry If in all cases two eigenvalues contribute equally, then their shapley values are equal. Dummy If a feature does not change the predicted value, the Shapley value is 0 in either combination. Two Shapley values are additive.

Application 2.

1.3 Approximate estimation Shapley Value for a single feature

A Monte Carlo sampling method is used to approximate the Shapley Value.


ϕ j ^ = 1 M m = 1 M ( f ^ ( x + j m ) f ^ ( x j m ) ) \hat{\phi_j}=\frac{1}{M} \sum_{m=1}^{M}(\hat{f}(x_{+j}^m)-\hat{f}(x_{-j}^m))

Algorithm steps:

  1. from
    m = 1 . . . . . M m=1,… ,M

    1. Take a random sample of z from x
    2. Random permutation of features o
    3. X: xo=(x1,… ,xj,… ,xp)x_o=(x_1,… ,x_j,… ,x_p)xo=(x1,… ,xj,… ,xp)
    4. Order Z: zo=(z1,… ,zj,… ,zp)z_o=(z_1,… ,z_j,… ,z_p)zo=(z1,… ,zj,… ,zp)
    5. Construct a new sample
      • J: x+j=(x1,… , xj – 1, xj, what zj had + 1… ,zp)x_{+j}=(x_1,… ,x_{j-1},x_j,z_{j+1}… ,z_p)x+j=(x1,… , xj – 1, xj, what zj had + 1… ,zp)
      • J: x−j=(x1,… , xj – 1, what zj had, what zj had + 1… ,zp)x_{-j}=(x_1,… ,x_{j-1},z_j,z_{j+1}… , z_p) x – j = (x1,… , xj – 1, what zj had, what zj had + 1… ,zp)
    6. Calculation of marginal contribution ϕ jm = f ^ (x + j) – f ^ (x – j) \ phi_j ^ m = \ hat {} f (x_ + j} {) – \ hat {} f (x_ {- j}) ϕ jm = f ^ (x + j) – f ^ (x – j)
  2. Calculate the average Shapley Value: ϕ j (x) = 1 m ∑ m = 1 m ϕ jm \ phi_j (x) = \ frac {1} {m} \ sum_ {m = 1} ^ {m} \ phi_j ^ {m} ϕ j (x) = M1 ∑ m = 1 m ϕ jm

3. The advantages and disadvantages

3.1 the advantages

  • The method of comparing the predicted value with the average predicted value is a fairly distributed method.
  • Shapley Value calculations allow comparative interpretation.
  • This method is supported by a solid theory (validity, symmetry, dummy, additivity).
  • Prediction can be interpreted as a game phenomenon involving features.

3.2 disadvantages

  • Long calculation time.
  • Shapley Value can be misinterpreted.
    • Shapley Value is not available for sparse interpretability (containing fewer features). But this can be done with LIME or SHAP packages (which are also Shapley Value based but address sparsity issues)
  • Can be used for state changes in non-predictive models.
  • You have to have access to data.
  • For non-realistic samples, the calculated value will be affected.