Additional Feature Attribution Plots Documentation

The scikit-explain package includes scatter plots that were originally developed in the SHAP package. These plots include a scatter plot similar to demostrate feature effects (e.g. like partial dependence) and plots to demostrate feature rankings (e.g., like permutation importance).

[1]:
import sys, os
sys.path.insert(0, os.path.dirname(os.getcwd()))
[2]:
import skexplain
from skexplain.common.importance_utils import to_skexplain_importance
import plotting_config
import numpy as np
import shap

Loading the training data and pre-fit models

[3]:
estimators = skexplain.load_models()
X,y = skexplain.load_data()
[4]:
# X_subset is the set of examples we want to get attribution values for.
random_state = np.random.RandomState(42)
N=1000
ind = random_state.choice(len(X), size=N, replace=False)
X_subset = X.iloc[ind]
y_subset = y[ind]
X_subset.reset_index(inplace=True, drop=True)

explainer = skexplain.ExplainToolkit(estimators[0], X=X_subset,)

Compute the Feature Attributions

When computing SHAP values, the user has a couple options (passed in as shap_kwargs). First, we need to declare the masker. In this example, we are using a built-in method from the SHAP python package. We could also pass in a background dataset rather a masker (e.g., masker = shap.sample(X, 100).reset_index(drop=True)). Second, we can declare the SHAP algorithm we want to use. By default, we set algorithm = "auto" and the SHAP package will determine the best algorithm. If we use the Partition masker, I’ve found that it will default to the Exact and otherwise it will try to use Permutation (note there are other options, but for different model types).

[5]:
# For the LIME, we must provide the training dataset. We also denote any categorical features.
lime_kws = {'training_data' : X.values, 'categorical_names' : ['rural', 'urban']}

# As stated above, the masker handles the missing features. In this case, we are using correlations
# in the dataset to determine the feature groupings. These groups of features are remove or added into
# sets together.
shap_kws={'masker' : shap.maskers.Partition(X, max_samples=100, clustering="correlation"),
           'algorithm' : 'permutation'}
[6]:
# We already have results saved, but you set run=True to run the code yourself.
run=False
if run:
    method=['shap', 'lime', 'tree_interpreter']
    results = explainer.local_attributions(method=method,
                                           shap_kws=shap_kws,
                                           lime_kws=lime_kws,
                                           n_jobs=8,
                                          )
    explainer.save('../tutorial_data/attr_values.nc', results)
[7]:
results = explainer.load('../tutorial_data/attr_values.nc')
[8]:
results
[8]:
<xarray.Dataset>
Dimensions:                                 (n_examples: 100, n_features: 30)
Dimensions without coordinates: n_examples, n_features
Data variables:
    shap_values__Random Forest              (n_examples, n_features) float64 ...
    shap_bias__Random Forest                (n_examples) float64 34.28 ... 34.28
    lime_values__Random Forest              (n_examples, n_features) float64 ...
    lime_bias__Random Forest                (n_examples) float64 16.62 ... 47.31
    tree_interpreter_values__Random Forest  (n_examples, n_features) float64 ...
    tree_interpreter_bias__Random Forest    (n_examples) float64 39.18 ... 39.18
    X                                       (n_examples, n_features) float64 ...
Attributes:
    estimator_output:  probability
    estimators used:   Random Forest
    features:          ['dllwave_flux', 'dwpt2m', 'fric_vel', 'gflux', 'high_...
    method:            ['shap', 'lime', 'tree_interpreter']
[9]:
explain_obj = shap._explanation.Explanation(values = results['shap_values__Random Forest'].values,
                                            feature_names = results.attrs['features']
                                           )

SHAP-Style Ranking Plot

Once we compute the feature attributions values for a large number of examples, we can evaluate different patterns. For example, in the plot below, feature attributions values are ranked by their absolute sum. Additionally, the feature attributions values are color-coded by their normalized magnitude where red indicates a higher predictor value while blue indicates a lower predictor value. In this case, surface temperature (\(T_{sfc}\)) had the largest absolute sum and lower values increases the probability of freezing road surface temperatures.

[10]:
explainer.scatter_plot(
                    plot_type = 'summary',
                    dataset=results,
                    method = 'tree_interpreter',
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
        )
[10]:
(<AxesSubplot:xlabel='SHAP value (impact on model output)'>,
 <matplotlib.colors.LinearSegmentedColormap at 0x154a25435ac0>)
../_images/notebooks_additional_feature_attribution_plots_13_1.png

Converting Feature Attributions to Skexplain-style Ranking Plot

Instead of this plot, we can evaluate the SHAP-based ranking with the bar-style plot used in skexplain.

[11]:
# Convert the shap values to importance scores for plotting.
# For SHAP values, we can use the 'shap_sum' or 'shap_std' methods available
# in to_skexplain_importance.
shap_rank = to_skexplain_importance(results['shap_values__Random Forest'].values,
                                    estimator_name='Random Forest',
                               feature_names=X.columns, method='shap_sum')

lime_rank = to_skexplain_importance(results['lime_values__Random Forest'].values,
                                    estimator_name='Random Forest',
                               feature_names=X.columns, method='lime')

ti_rank = to_skexplain_importance(results['tree_interpreter_values__Random Forest'].values,
                                  estimator_name='Random Forest',
                               feature_names=X.columns, method='tree_interpreter')

explainer.plot_importance(data=[shap_rank, lime_rank, ti_rank],
                          panels = [('shap_sum', 'Random Forest'), ('lime', 'Random Forest'),
                                   ('tree_interpreter', 'Random Forest')],
                          display_feature_names=plotting_config.display_feature_names,
                          feature_colors=plotting_config.color_dict,
                          xlabels = ['SHAP', 'LIME', 'TI']
                          )
[11]:
(<Figure size 1800x750 with 5 Axes>,
 array([<AxesSubplot:xlabel='SHAP'>, <AxesSubplot:xlabel='LIME'>,
        <AxesSubplot:xlabel='TI'>], dtype=object))
../_images/notebooks_additional_feature_attribution_plots_15_1.png

Scatter Dependence Plot

Feature attributions can also be displayed similar to ALE/PD curve where the values are presented as a function of the predictor value.

[12]:
features = ['sat_irbt','d_rad_d', 'temp2m', 'hrrr_dT']
methods = ['shap']

# Note: If the plots are looking wonky, you can change the figsize.
explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = methods,
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    interaction_index=None,
)
[12]:
(<Figure size 2400x1500 with 6 Axes>,
 array([[<AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>],
        [<AxesSubplot:xlabel='$T_{sfc}$ - $T_{2m}$ ($^{\\circ}$C)'>,
         <AxesSubplot:>, <AxesSubplot:>]], dtype=object))
../_images/notebooks_additional_feature_attribution_plots_17_1.png

Scatter Depedence Plot with Color-coding.

We can color-code the scatter plot by the value of the feature it approximately interacts. The approximate interaction is determined by largest variation in feautre attribution values for the first feature based on the input values of another feature. To determine the interaction automatically, set interaction_index='auto'. You can manually set the interaction_index to a feature column.

[13]:
import matplotlib.pyplot as plt
features = ['sat_irbt', 'd_rad_d', 'hrrr_dT', 'temp2m', 'mid_cloud', 'swave_flux', 'd_ground']
methods = ['tree_interpreter']

explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = methods,
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    target='target',
                    interaction_index='auto',
                    figsize=(8,8), colorbar_pad=0.2, wspace=0.7, hspace=0.6
)
#plt.tight_layout()
[13]:
(<Figure size 2400x2400 with 16 Axes>,
 array([[<AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{sfc}$ - $T_{2m}$ ($^{\\circ}$C)'>],
        [<AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='$Cloud_{mid}$ (%)'>,
         <AxesSubplot:xlabel='$I_{S}$ (W m$^{-2}$)'>],
        [<AxesSubplot:xlabel='Diff1 (W m$^{-2}$)'>, <AxesSubplot:>,
         <AxesSubplot:>]], dtype=object))
../_images/notebooks_additional_feature_attribution_plots_19_1.png
[14]:
features = ['sat_irbt', 'd_rad_d', 'temp2m', 'hrrr_dT']
methods = ['tree_interpreter']
explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = methods,
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    target='target',
                    interaction_index='temp2m',
)
[14]:
(<Figure size 2400x1500 with 10 Axes>,
 array([[<AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>],
        [<AxesSubplot:xlabel='$T_{sfc}$ - $T_{2m}$ ($^{\\circ}$C)'>,
         <AxesSubplot:>, <AxesSubplot:>]], dtype=object))
../_images/notebooks_additional_feature_attribution_plots_20_1.png

Scatter Depedence Plot with Color-coding, background histogram, and ALE curves

We can also include histdata, which is a combination of X and y. For classification problems, the user can provide the name of the target variable and the background histogram will be color-coded for the different classes. Also, it is possible to add the ALE with a few extra lines of code.

[15]:
features = ['sat_irbt', 'd_rad_d', 'temp2m', 'hrrr_dT']
explainer = skexplain.ExplainToolkit(estimators[0], X=X,)
ale_1d_ds = explainer.ale(features=features, n_bootstrap=1,
                          subsample=10000, n_jobs=1, n_bins=20)
ALE Numerical Features: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.11it/s]
[16]:
features = ['sat_irbt', 'd_rad_d', 'temp2m', 'hrrr_dT']

histdata=X.copy()
histdata['target'] = y
est_name = estimators[0][0]

fig, axes = explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = ['tree_interpreter'],
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    histdata=histdata,
                    interaction_index='auto',
                    colorbar_pad=0.5, wspace=0.7, hspace=0.6

)

for ax, f in zip(axes.flat, features):
    x = ale_1d_ds[f'{f}__bin_values']
    y = 100*np.mean(ale_1d_ds[f'{f}__{est_name}__ale'], axis=0)
    ax.plot(x,y,color='k', lw=2.0 )

../_images/notebooks_additional_feature_attribution_plots_23_0.png

Scatter Dependence Plot with color-coding based on the target value.

To color by the target values, set the target_values to y and keep interaction_index=None

[17]:
features = ['tmp2m_hrs_bl_frez', 'sat_irbt', 'sfcT_hrs_ab_frez', 'tmp2m_hrs_ab_frez', 'd_rad_d', 'temp2m']

explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = ['tree_interpreter'],
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    histdata=histdata,
                    target_values=y_subset,
                    interaction_index=None,
)
[17]:
(<Figure size 2400x1500 with 14 Axes>,
 array([[<AxesSubplot:xlabel='Hours $T_{2m}$ $\\leq $ 0 (hrs)'>,
         <AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Hours $T_{sfc}$ $>$ 0 (hrs)'>],
        [<AxesSubplot:xlabel='Hours $T_{2m}$ $>$ 0 (hrs)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>]], dtype=object))
../_images/notebooks_additional_feature_attribution_plots_25_1.png

Feature Attribution plots for a Regression Task

[18]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
[19]:
data = fetch_california_housing()
X = data['data']
y = data['target']
feature_names = data['feature_names']
model= RandomForestRegressor()
model.fit(X,y)
[19]:
RandomForestRegressor()
[20]:
# X_subset is the set of examples we want to get SHAP values for.
random_state = np.random.RandomState(42)

ind = random_state.choice(len(X), size=100, replace=False)
X_subset = X[ind]
y_subset = y[ind]
[21]:
explainer = skexplain.ExplainToolkit(('Random Forest', model), X=X_subset, feature_names=feature_names)
[22]:
results = explainer.local_attributions('shap', shap_kws={
    'masker' : shap.maskers.Partition(X, max_samples=100, clustering='correlation'),
    'algorithm' : 'auto'})
Exact explainer: 101it [00:12,  2.01it/s]
[23]:
explainer.scatter_plot(
                    plot_type = 'summary',
                    estimator_name = 'Random Forest',
                    dataset=results,
                    display_feature_names=plotting_config.display_feature_names,
)
[23]:
(<AxesSubplot:xlabel='SHAP value (impact on model output)'>,
 <matplotlib.colors.LinearSegmentedColormap at 0x154a25435ac0>)
../_images/notebooks_additional_feature_attribution_plots_32_1.png
[27]:
est_name = 'Random Forest'
explainer = skexplain.ExplainToolkit((est_name, model), X=X, feature_names=feature_names)
ale_1d_ds = explainer.ale(features='all', n_bootstrap=1,
                          subsample=10000, n_jobs=1, n_bins=20)

fig, axes = explainer.scatter_plot(
                    plot_type = 'dependence',
                    estimator_name = est_name,
                    dataset=results,
                    features = feature_names,
                    display_feature_names=plotting_config.display_feature_names,
)

for ax, f in zip(axes.flat, feature_names):
    x = ale_1d_ds[f'{f}__bin_values']
    y = np.mean(ale_1d_ds[f'{f}__{est_name}__ale'], axis=0)
    ax.plot(x,y,color='k', lw=2.0 )

ALE Numerical Features: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.33it/s]
../_images/notebooks_additional_feature_attribution_plots_33_1.png