{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# End to end XGBoost regression\n", "\n", "This notebooks synthesizes all the previous notebooks into a single pipeline. It is a good starting point to understand how to use the pipeline from end to end." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we will train an XGBoost Regressor model to classify the boston dataset to predict consommation credit value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load data and train the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "sys.path.append(\"../\")\n", "\n", "import torch\n", "\n", "from beexai.dataset.dataset import Dataset\n", "from beexai.dataset.load_data import load_data\n", "from beexai.evaluate.metrics.get_results import get_all_metrics\n", "from beexai.explanation.explaining import CaptumExplainer\n", "from beexai.training.train import Trainer\n", "from beexai.utils.path import create_dir\n", "from beexai.utils.sampling import stratified_sampling\n", "from beexai.utils.time_seed import set_seed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we don't add any other column to the dataset so we can use the `load_data` function directly without specifying `add_list` or `values_to_delete` arguments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seed = 42\n", "set_seed(seed)\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "\n", "DATA_NAME = \"boston\"\n", "MODEL_NAME = \"XGBRegressor\"\n", "\n", "create_dir(f\"../output/data\")\n", "CONFIG_PATH = f\"config/{DATA_NAME}.yml\"\n", "data_test, target_col, task, dataCleaner = load_data(\n", " from_cleaned=True, config_path=CONFIG_PATH, keep_corr_features=True\n", ")\n", "scale_params = {\"x_num_scaler_name\": \"quantile_normal\", \"y_scaler_name\": \"standard\"}\n", "data = Dataset(data_test, target_col)\n", "X_train, X_test, y_train, y_test = data.get_train_test(\n", " test_size=0.2, scaler_params=scale_params\n", ")\n", "num_labels = data.get_classes_num(task)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case of sklearn models, no additional parameters are needed to train the model if we want to use the default parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainer = Trainer(MODEL_NAME, task, device=device)\n", "trainer.train(torch.tensor(X_train.values), torch.tensor(y_train.values))\n", "metrics = trainer.get_metrics(X_test.values, y_test.values)\n", "for k, v in metrics.items():\n", " print(f\"{k}: {v}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dir(f\"../output/models/{DATA_NAME}\")\n", "trainer.save_model(f\"../output/models/{DATA_NAME}/{MODEL_NAME}.joblib\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For faster testing, we use the function `stratified_sampling` that samples a fraction of the data while keeping the same distribution of the target variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_test, y_test = stratified_sampling(X_test, y_test, 100, task)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Captum Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many choices of explainers are available in Captum. We will use the `ShapleyValueSampling` explainer for this example but it is also possible to use `Lime` or `KernelShap` but not `DeepLift` or `IntegratedGradients` as they are not compatible with tree-based models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "explainer = CaptumExplainer(\n", " trainer.model, task=task, method=\"ShapleyValueSampling\", sklearn=True, device=device\n", ")\n", "explainer.init_explainer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XAI metric for Shapley Value Sampling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Several quantitative metrics are also implemented to evaluate the explanations. It is also possible to have safety checks on the explanations with the training of a model on shuffled labels and also a random explainability baseline. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_preds = trainer.model.predict(X_test.values)\n", "\n", "get_all_metrics(\n", " X_test,\n", " all_preds,\n", " trainer.model,\n", " explainer,\n", " baseline=\"zero\",\n", " auc_metric=\"mse\",\n", " print_plot=False,\n", " save_path=None,\n", " device=device,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }