{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# End to end XGBoost regression\n",
    "\n",
    "This notebooks synthesizes all the previous notebooks into a single pipeline. It is a good starting point to understand how to use the pipeline from end to end."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we will train an XGBoost Regressor model to classify the boston dataset to predict consommation credit value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load data and train the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "sys.path.append(\"../\")\n",
    "\n",
    "import torch\n",
    "\n",
    "from beexai.dataset.dataset import Dataset\n",
    "from beexai.dataset.load_data import load_data\n",
    "from beexai.evaluate.metrics.get_results import get_all_metrics\n",
    "from beexai.explanation.explaining import CaptumExplainer\n",
    "from beexai.training.train import Trainer\n",
    "from beexai.utils.path import create_dir\n",
    "from beexai.utils.sampling import stratified_sampling\n",
    "from beexai.utils.time_seed import set_seed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example, we don't add any other column to the dataset so we can use the `load_data` function directly without specifying `add_list` or `values_to_delete` arguments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seed = 42\n",
    "set_seed(seed)\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "DATA_NAME = \"boston\"\n",
    "MODEL_NAME = \"XGBRegressor\"\n",
    "\n",
    "create_dir(f\"../output/data\")\n",
    "CONFIG_PATH = f\"config/{DATA_NAME}.yml\"\n",
    "data_test, target_col, task, dataCleaner = load_data(\n",
    "    from_cleaned=True, config_path=CONFIG_PATH, keep_corr_features=True\n",
    ")\n",
    "scale_params = {\"x_num_scaler_name\": \"quantile_normal\", \"y_scaler_name\": \"standard\"}\n",
    "data = Dataset(data_test, target_col)\n",
    "X_train, X_test, y_train, y_test = data.get_train_test(\n",
    "    test_size=0.2, scaler_params=scale_params\n",
    ")\n",
    "num_labels = data.get_classes_num(task)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the case of sklearn models, no additional parameters are needed to train the model if we want to use the default parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer = Trainer(MODEL_NAME, task, device=device)\n",
    "trainer.train(torch.tensor(X_train.values), torch.tensor(y_train.values))\n",
    "metrics = trainer.get_metrics(X_test.values, y_test.values)\n",
    "for k, v in metrics.items():\n",
    "    print(f\"{k}: {v}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_dir(f\"../output/models/{DATA_NAME}\")\n",
    "trainer.save_model(f\"../output/models/{DATA_NAME}/{MODEL_NAME}.joblib\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For faster testing, we use the function `stratified_sampling` that samples a fraction of the data while keeping the same distribution of the target variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test, y_test = stratified_sampling(X_test, y_test, 100, task)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Captum Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many choices of explainers are available in Captum. We will use the `ShapleyValueSampling` explainer for this example but it is also possible to use `Lime` or `KernelShap` but not `DeepLift` or `IntegratedGradients` as they are not compatible with tree-based models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "explainer = CaptumExplainer(\n",
    "    trainer.model, task=task, method=\"ShapleyValueSampling\", sklearn=True, device=device\n",
    ")\n",
    "explainer.init_explainer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### XAI metric for Shapley Value Sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Several quantitative metrics are also implemented to evaluate the explanations. It is also possible to have safety checks on the explanations with the training of a model on shuffled labels and also a random explainability baseline. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_preds = trainer.model.predict(X_test.values)\n",
    "\n",
    "get_all_metrics(\n",
    "    X_test,\n",
    "    all_preds,\n",
    "    trainer.model,\n",
    "    explainer,\n",
    "    baseline=\"zero\",\n",
    "    auc_metric=\"mse\",\n",
    "    print_plot=False,\n",
    "    save_path=None,\n",
    "    device=device,\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}