End to end XGBoost regression

This notebooks synthesizes all the previous notebooks into a single pipeline. It is a good starting point to understand how to use the pipeline from end to end.

In this example, we will train an XGBoost Regressor model to classify the boston dataset to predict consommation credit value.

Load data and train the model

[ ]:

import sys

sys.path.append("../")

import torch

from beexai.dataset.dataset import Dataset
from beexai.dataset.load_data import load_data
from beexai.evaluate.metrics.get_results import get_all_metrics
from beexai.explanation.explaining import CaptumExplainer
from beexai.training.train import Trainer
from beexai.utils.path import create_dir
from beexai.utils.sampling import stratified_sampling
from beexai.utils.time_seed import set_seed

For this example, we don’t add any other column to the dataset so we can use the load_data function directly without specifying add_list or values_to_delete arguments.

[ ]:

seed = 42
set_seed(seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

DATA_NAME = "boston"
MODEL_NAME = "XGBRegressor"

create_dir(f"../output/data")
CONFIG_PATH = f"config/{DATA_NAME}.yml"
data_test, target_col, task, dataCleaner = load_data(
    from_cleaned=True, config_path=CONFIG_PATH, keep_corr_features=True
)
scale_params = {"x_num_scaler_name": "quantile_normal", "y_scaler_name": "standard"}
data = Dataset(data_test, target_col)
X_train, X_test, y_train, y_test = data.get_train_test(
    test_size=0.2, scaler_params=scale_params
)
num_labels = data.get_classes_num(task)

In the case of sklearn models, no additional parameters are needed to train the model if we want to use the default parameters.

[ ]:

trainer = Trainer(MODEL_NAME, task, device=device)
trainer.train(torch.tensor(X_train.values), torch.tensor(y_train.values))
metrics = trainer.get_metrics(X_test.values, y_test.values)
for k, v in metrics.items():
    print(f"{k}: {v}")

[ ]:

create_dir(f"../output/models/{DATA_NAME}")
trainer.save_model(f"../output/models/{DATA_NAME}/{MODEL_NAME}.joblib")

For faster testing, we use the function stratified_sampling that samples a fraction of the data while keeping the same distribution of the target variable.

[ ]:

X_test, y_test = stratified_sampling(X_test, y_test, 100, task)

Captum Models

Many choices of explainers are available in Captum. We will use the ShapleyValueSampling explainer for this example but it is also possible to use Lime or KernelShap but not DeepLift or IntegratedGradients as they are not compatible with tree-based models.

[ ]:

explainer = CaptumExplainer(
    trainer.model, task=task, method="ShapleyValueSampling", sklearn=True, device=device
)
explainer.init_explainer()

XAI metric for Shapley Value Sampling

Several quantitative metrics are also implemented to evaluate the explanations. It is also possible to have safety checks on the explanations with the training of a model on shuffled labels and also a random explainability baseline.

[ ]:

all_preds = trainer.model.predict(X_test.values)

get_all_metrics(
    X_test,
    all_preds,
    trainer.model,
    explainer,
    baseline="zero",
    auc_metric="mse",
    print_plot=False,
    save_path=None,
    device=device,
)