Additional Feature Attribution Plots Documentation

The scikit-explain package includes scatter plots that were originally developed in the SHAP package. These plots include a scatter plot similar to demostrate feature effects (e.g. like partial dependence) and plots to demostrate feature rankings (e.g., like permutation importance).

[1]:

import sys, os
sys.path.insert(0, os.path.dirname(os.getcwd()))

[2]:

import skexplain
from skexplain.common.importance_utils import to_skexplain_importance
import plotting_config
import numpy as np
import shap

Converting Feature Attributions to Skexplain-style Ranking Plot

Instead of this plot, we can evaluate the SHAP-based ranking with the bar-style plot used in skexplain.

[11]:

# Convert the shap values to importance scores for plotting.
# For SHAP values, we can use the 'shap_sum' or 'shap_std' methods available
# in to_skexplain_importance.
shap_rank = to_skexplain_importance(results['shap_values__Random Forest'].values,
                                    estimator_name='Random Forest',
                               feature_names=X.columns, method='shap_sum')

lime_rank = to_skexplain_importance(results['lime_values__Random Forest'].values,
                                    estimator_name='Random Forest',
                               feature_names=X.columns, method='lime')

ti_rank = to_skexplain_importance(results['tree_interpreter_values__Random Forest'].values,
                                  estimator_name='Random Forest',
                               feature_names=X.columns, method='tree_interpreter')

explainer.plot_importance(data=[shap_rank, lime_rank, ti_rank],
                          panels = [('shap_sum', 'Random Forest'), ('lime', 'Random Forest'),
                                   ('tree_interpreter', 'Random Forest')],
                          display_feature_names=plotting_config.display_feature_names,
                          feature_colors=plotting_config.color_dict,
                          xlabels = ['SHAP', 'LIME', 'TI']
                          )

[11]:

(<Figure size 1800x750 with 5 Axes>,
 array([<AxesSubplot:xlabel='SHAP'>, <AxesSubplot:xlabel='LIME'>,
        <AxesSubplot:xlabel='TI'>], dtype=object))

../_images/notebooks_additional_feature_attribution_plots_15_1.png

Scatter Dependence Plot

Feature attributions can also be displayed similar to ALE/PD curve where the values are presented as a function of the predictor value.

[12]:

features = ['sat_irbt','d_rad_d', 'temp2m', 'hrrr_dT']
methods = ['shap']

# Note: If the plots are looking wonky, you can change the figsize.
explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = methods,
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    interaction_index=None,
)

[12]:

(<Figure size 2400x1500 with 6 Axes>,
 array([[<AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>],
        [<AxesSubplot:xlabel='$T_{sfc}$ - $T_{2m}$ ($^{\\circ}$C)'>,
         <AxesSubplot:>, <AxesSubplot:>]], dtype=object))

../_images/notebooks_additional_feature_attribution_plots_17_1.png

Scatter Depedence Plot with Color-coding.

We can color-code the scatter plot by the value of the feature it approximately interacts. The approximate interaction is determined by largest variation in feautre attribution values for the first feature based on the input values of another feature. To determine the interaction automatically, set interaction_index='auto'. You can manually set the interaction_index to a feature column.

[13]:

import matplotlib.pyplot as plt
features = ['sat_irbt', 'd_rad_d', 'hrrr_dT', 'temp2m', 'mid_cloud', 'swave_flux', 'd_ground']
methods = ['tree_interpreter']

explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = methods,
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    target='target',
                    interaction_index='auto',
                    figsize=(8,8), colorbar_pad=0.2, wspace=0.7, hspace=0.6
)
#plt.tight_layout()

[13]:

(<Figure size 2400x2400 with 16 Axes>,
 array([[<AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{sfc}$ - $T_{2m}$ ($^{\\circ}$C)'>],
        [<AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='$Cloud_{mid}$ (%)'>,
         <AxesSubplot:xlabel='$I_{S}$ (W m$^{-2}$)'>],
        [<AxesSubplot:xlabel='Diff1 (W m$^{-2}$)'>, <AxesSubplot:>,
         <AxesSubplot:>]], dtype=object))

../_images/notebooks_additional_feature_attribution_plots_19_1.png

[14]:

features = ['sat_irbt', 'd_rad_d', 'temp2m', 'hrrr_dT']
methods = ['tree_interpreter']
explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = methods,
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    target='target',
                    interaction_index='temp2m',
)

[14]:

(<Figure size 2400x1500 with 10 Axes>,
 array([[<AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>],
        [<AxesSubplot:xlabel='$T_{sfc}$ - $T_{2m}$ ($^{\\circ}$C)'>,
         <AxesSubplot:>, <AxesSubplot:>]], dtype=object))

../_images/notebooks_additional_feature_attribution_plots_20_1.png

Scatter Depedence Plot with Color-coding, background histogram, and ALE curves

We can also include histdata, which is a combination of X and y. For classification problems, the user can provide the name of the target variable and the background histogram will be color-coded for the different classes. Also, it is possible to add the ALE with a few extra lines of code.

[15]:

features = ['sat_irbt', 'd_rad_d', 'temp2m', 'hrrr_dT']
explainer = skexplain.ExplainToolkit(estimators[0], X=X,)
ale_1d_ds = explainer.ale(features=features, n_bootstrap=1,
                          subsample=10000, n_jobs=1, n_bins=20)

ALE Numerical Features: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.11it/s]

[16]:

features = ['sat_irbt', 'd_rad_d', 'temp2m', 'hrrr_dT']

histdata=X.copy()
histdata['target'] = y
est_name = estimators[0][0]

fig, axes = explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = ['tree_interpreter'],
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    histdata=histdata,
                    interaction_index='auto',
                    colorbar_pad=0.5, wspace=0.7, hspace=0.6

)

for ax, f in zip(axes.flat, features):
    x = ale_1d_ds[f'{f}__bin_values']
    y = 100*np.mean(ale_1d_ds[f'{f}__{est_name}__ale'], axis=0)
    ax.plot(x,y,color='k', lw=2.0 )

../_images/notebooks_additional_feature_attribution_plots_23_0.png

Scatter Dependence Plot with color-coding based on the target value.

To color by the target values, set the target_values to y and keep interaction_index=None

[17]:

features = ['tmp2m_hrs_bl_frez', 'sat_irbt', 'sfcT_hrs_ab_frez', 'tmp2m_hrs_ab_frez', 'd_rad_d', 'temp2m']

explainer.scatter_plot(features=features,
                    plot_type = 'dependence',
                    dataset=results,
                    method = ['tree_interpreter'],
                    estimator_name = 'Random Forest',
                    display_feature_names=plotting_config.display_feature_names,
                    display_units = plotting_config.display_units,
                    histdata=histdata,
                    target_values=y_subset,
                    interaction_index=None,
)

[17]:

(<Figure size 2400x1500 with 14 Axes>,
 array([[<AxesSubplot:xlabel='Hours $T_{2m}$ $\\leq $ 0 (hrs)'>,
         <AxesSubplot:xlabel='IRBT ($^{\\circ}$C)'>,
         <AxesSubplot:xlabel='Hours $T_{sfc}$ $>$ 0 (hrs)'>],
        [<AxesSubplot:xlabel='Hours $T_{2m}$ $>$ 0 (hrs)'>,
         <AxesSubplot:xlabel='Diff2 (W m$^{-2}$)'>,
         <AxesSubplot:xlabel='$T_{2m}$ ($^{\\circ}$C)'>]], dtype=object))

../_images/notebooks_additional_feature_attribution_plots_25_1.png

Feature Attribution plots for a Regression Task

[18]:

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor

[19]:

data = fetch_california_housing()
X = data['data']
y = data['target']
feature_names = data['feature_names']
model= RandomForestRegressor()
model.fit(X,y)

[19]:

RandomForestRegressor()

[20]:

# X_subset is the set of examples we want to get SHAP values for.
random_state = np.random.RandomState(42)

ind = random_state.choice(len(X), size=100, replace=False)
X_subset = X[ind]
y_subset = y[ind]

[21]:

explainer = skexplain.ExplainToolkit(('Random Forest', model), X=X_subset, feature_names=feature_names)

[22]:

results = explainer.local_attributions('shap', shap_kws={
    'masker' : shap.maskers.Partition(X, max_samples=100, clustering='correlation'),
    'algorithm' : 'auto'})

Exact explainer: 101it [00:12,  2.01it/s]

[23]:

explainer.scatter_plot(
                    plot_type = 'summary',
                    estimator_name = 'Random Forest',
                    dataset=results,
                    display_feature_names=plotting_config.display_feature_names,
)

[23]:

(<AxesSubplot:xlabel='SHAP value (impact on model output)'>,
 <matplotlib.colors.LinearSegmentedColormap at 0x154a25435ac0>)

../_images/notebooks_additional_feature_attribution_plots_32_1.png

[27]:

est_name = 'Random Forest'
explainer = skexplain.ExplainToolkit((est_name, model), X=X, feature_names=feature_names)
ale_1d_ds = explainer.ale(features='all', n_bootstrap=1,
                          subsample=10000, n_jobs=1, n_bins=20)

fig, axes = explainer.scatter_plot(
                    plot_type = 'dependence',
                    estimator_name = est_name,
                    dataset=results,
                    features = feature_names,
                    display_feature_names=plotting_config.display_feature_names,
)

for ax, f in zip(axes.flat, feature_names):
    x = ale_1d_ds[f'{f}__bin_values']
    y = np.mean(ale_1d_ds[f'{f}__{est_name}__ale'], axis=0)
    ax.plot(x,y,color='k', lw=2.0 )

ALE Numerical Features: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.33it/s]

../_images/notebooks_additional_feature_attribution_plots_33_1.png