{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Permutation Importance\n", "Single-pass and multi-pass permutation importance are model-agnostic methods for ranking features by their contribution to model performance. In the single-pass method, each feature is individually permuted and the resulting drop in performance is measured. The multi-pass method extends this by keeping the most important feature permuted before assessing the next, which helps break inter-feature correlations.\n", "\n", "This notebook demonstrates how to compute and visualize both methods, compare forward vs. backward selection, and annotate correlated features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import skexplain\n", "import plotting_config" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimators = skexplain.load_models()\n", "X, y = skexplain.load_data()\n", "\n", "print(estimators)\n", "print(f'X Shape : {X.shape}')\n", "print(f'y Skew : {y.mean()*100:.1f}%')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "explainer = skexplain.ExplainToolkit(estimators=estimators, X=X, y=y)\n", "\n", "explainer.set_plotting_config(\n", " display_feature_names=plotting_config.display_feature_names,\n", " display_units=plotting_config.display_units,\n", " feature_colors=plotting_config.color_dict,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing Permutation Importance\n", "We compute the backward multi-pass permutation importance for the top 10 features using the Normalized Area Under the Performance Diagram Curve (NAUPDC) as the evaluation metric. The `n_permute=5` setting produces bootstrap confidence intervals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = explainer.permutation_importance(\n", " n_vars=10,\n", " evaluation_fn='norm_aupdc',\n", " n_permute=5,\n", " subsample=0.1,\n", " n_jobs=8,\n", " verbose=True,\n", " random_seed=42,\n", " direction='backward',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting Single-Pass Importance\n", "The first iteration of the multi-pass method is the single-pass result and is saved by default. The `panels` argument controls what to display: `(method, estimator_name)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = explainer.plot_importance(\n", " data=results,\n", " panels=[('backward_singlepass', 'Random Forest')],\n", " num_vars_to_plot=15,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-Pass Importance\n", "Multi-pass keeps previously identified important features permuted, breaking inter-feature correlations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = explainer.plot_importance(\n", " data=[results]*2,\n", " panels=[\n", " ('backward_multipass', 'Random Forest'),\n", " ('backward_multipass', 'Logistic Regression'),\n", " ],\n", " num_vars_to_plot=10,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing Single-Pass vs Multi-Pass\n", "Placing single-pass and multi-pass results side-by-side reveals how inter-feature correlations affect the rankings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = explainer.plot_importance(\n", " data=[results]*2,\n", " panels=[\n", " ('backward_singlepass', 'Random Forest'),\n", " ('backward_multipass', 'Random Forest'),\n", " ],\n", " num_vars_to_plot=10,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forward vs Backward Selection\n", "The backward method starts with unaltered features and progressively permutes them. The forward method starts with all features permuted and progressively un-permutes them. Comparing both provides a more complete picture." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forward_results = explainer.permutation_importance(\n", " n_vars=10,\n", " evaluation_fn='norm_aupdc',\n", " n_permute=5,\n", " subsample=0.1,\n", " n_jobs=8,\n", " verbose=True,\n", " random_seed=42,\n", " direction='forward',\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = explainer.plot_importance(\n", " data=[results]*3 + [forward_results]*3,\n", " panels=[\n", " ('backward_multipass', 'Random Forest'),\n", " ('backward_multipass', 'Gradient Boosting'),\n", " ('backward_multipass', 'Logistic Regression'),\n", " ('forward_multipass', 'Random Forest'),\n", " ('forward_multipass', 'Gradient Boosting'),\n", " ('forward_multipass', 'Logistic Regression'),\n", " ],\n", " ylabels=['Backward', 'Forward'],\n", " figsize=(8, 5),\n", " hspace=0.2,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Annotating Correlated Features\n", "Permutation importance assumes independent features. When strong correlations exist, the rankings can be distorted. Setting `plot_correlated_features=True` annotates correlated pairs on the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = explainer.plot_importance(\n", " data=[results]*3,\n", " panels=[\n", " ('backward_multipass', 'Random Forest'),\n", " ('backward_multipass', 'Gradient Boosting'),\n", " ('backward_multipass', 'Logistic Regression'),\n", " ],\n", " plot_correlated_features=True,\n", " rho_threshold=0.6,\n", " figsize=(13, 4),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "- McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the Black Box More Transparent: Understanding the Physical Implications of Machine Learning. *Bull. Amer. Meteor. Soc.*, 100, 2175-2199.\n", "- Lakshmanan, V., C. Karstens, J. Krause, K. Elmore, A. Ryzhkov, and S. Berkseth, 2015: Which Polarimetric Variables Are Important for Weather/No-Weather Discrimination? *J. Atmos. Oceanic Technol.*, 32, 1209-1223.\n", "- Flora, M. L., C. K. Potvin, and A. McGovern, 2021: The Use of a Machine Learning Approach to Predict the Quality of Model Output Statistics. *Mon. Wea. Rev.*, 149, 1367-1385." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }