{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Permutation Importance\n",
    "Single-pass and multi-pass permutation importance are model-agnostic methods for ranking features by their contribution to model performance. In the single-pass method, each feature is individually permuted and the resulting drop in performance is measured. The multi-pass method extends this by keeping the most important feature permuted before assessing the next, which helps break inter-feature correlations.\n",
    "\n",
    "This notebook demonstrates how to compute and visualize both methods, compare forward vs. backward selection, and annotate correlated features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import skexplain\n",
    "import plotting_config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimators = skexplain.load_models()\n",
    "X, y = skexplain.load_data()\n",
    "\n",
    "print(estimators)\n",
    "print(f'X Shape : {X.shape}')\n",
    "print(f'y Skew : {y.mean()*100:.1f}%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "explainer = skexplain.ExplainToolkit(estimators=estimators, X=X, y=y)\n",
    "\n",
    "explainer.set_plotting_config(\n",
    "    display_feature_names=plotting_config.display_feature_names,\n",
    "    display_units=plotting_config.display_units,\n",
    "    feature_colors=plotting_config.color_dict,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing Permutation Importance\n",
    "We compute the backward multi-pass permutation importance for the top 10 features using the Normalized Area Under the Performance Diagram Curve (NAUPDC) as the evaluation metric. The `n_permute=5` setting produces bootstrap confidence intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = explainer.permutation_importance(\n",
    "    n_vars=10,\n",
    "    evaluation_fn='norm_aupdc',\n",
    "    n_permute=5,\n",
    "    subsample=0.1,\n",
    "    n_jobs=8,\n",
    "    verbose=True,\n",
    "    random_seed=42,\n",
    "    direction='backward',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting Single-Pass Importance\n",
    "The first iteration of the multi-pass method is the single-pass result and is saved by default. The `panels` argument controls what to display: `(method, estimator_name)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = explainer.plot_importance(\n",
    "    data=results,\n",
    "    panels=[('backward_singlepass', 'Random Forest')],\n",
    "    num_vars_to_plot=15,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-Pass Importance\n",
    "Multi-pass keeps previously identified important features permuted, breaking inter-feature correlations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = explainer.plot_importance(\n",
    "    data=[results]*2,\n",
    "    panels=[\n",
    "        ('backward_multipass', 'Random Forest'),\n",
    "        ('backward_multipass', 'Logistic Regression'),\n",
    "    ],\n",
    "    num_vars_to_plot=10,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing Single-Pass vs Multi-Pass\n",
    "Placing single-pass and multi-pass results side-by-side reveals how inter-feature correlations affect the rankings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = explainer.plot_importance(\n",
    "    data=[results]*2,\n",
    "    panels=[\n",
    "        ('backward_singlepass', 'Random Forest'),\n",
    "        ('backward_multipass', 'Random Forest'),\n",
    "    ],\n",
    "    num_vars_to_plot=10,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Forward vs Backward Selection\n",
    "The backward method starts with unaltered features and progressively permutes them. The forward method starts with all features permuted and progressively un-permutes them. Comparing both provides a more complete picture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forward_results = explainer.permutation_importance(\n",
    "    n_vars=10,\n",
    "    evaluation_fn='norm_aupdc',\n",
    "    n_permute=5,\n",
    "    subsample=0.1,\n",
    "    n_jobs=8,\n",
    "    verbose=True,\n",
    "    random_seed=42,\n",
    "    direction='forward',\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = explainer.plot_importance(\n",
    "    data=[results]*3 + [forward_results]*3,\n",
    "    panels=[\n",
    "        ('backward_multipass', 'Random Forest'),\n",
    "        ('backward_multipass', 'Gradient Boosting'),\n",
    "        ('backward_multipass', 'Logistic Regression'),\n",
    "        ('forward_multipass', 'Random Forest'),\n",
    "        ('forward_multipass', 'Gradient Boosting'),\n",
    "        ('forward_multipass', 'Logistic Regression'),\n",
    "    ],\n",
    "    ylabels=['Backward', 'Forward'],\n",
    "    figsize=(8, 5),\n",
    "    hspace=0.2,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annotating Correlated Features\n",
    "Permutation importance assumes independent features. When strong correlations exist, the rankings can be distorted. Setting `plot_correlated_features=True` annotates correlated pairs on the plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = explainer.plot_importance(\n",
    "    data=[results]*3,\n",
    "    panels=[\n",
    "        ('backward_multipass', 'Random Forest'),\n",
    "        ('backward_multipass', 'Gradient Boosting'),\n",
    "        ('backward_multipass', 'Logistic Regression'),\n",
    "    ],\n",
    "    plot_correlated_features=True,\n",
    "    rho_threshold=0.6,\n",
    "    figsize=(13, 4),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "- McGovern, A., R. Lagerquist, D. John Gagne, G. E. Jergensen, K. L. Elmore, C. R. Homeyer, and T. Smith, 2019: Making the Black Box More Transparent: Understanding the Physical Implications of Machine Learning. *Bull. Amer. Meteor. Soc.*, 100, 2175-2199.\n",
    "- Lakshmanan, V., C. Karstens, J. Krause, K. Elmore, A. Ryzhkov, and S. Berkseth, 2015: Which Polarimetric Variables Are Important for Weather/No-Weather Discrimination? *J. Atmos. Oceanic Technol.*, 32, 1209-1223.\n",
    "- Flora, M. L., C. K. Potvin, and A. McGovern, 2021: The Use of a Machine Learning Approach to Predict the Quality of Model Output Statistics. *Mon. Wea. Rev.*, 149, 1367-1385."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}