How to train a model on any dataset

This notebooks shows how to load your tabular data with the implemented pipeline and train a simple Neural Network for further usage in explainability notebooks.

Declare your data

Don’t forget to write your config file for your dataset in config/ folder. Some basic examples are available for classification and regression. You can also add specific values based on some columns operations and delete specific values at the beginning. Other options (colums to drop,datetime colums) need to be declared directly in the config file.

We will use the config/kickstarter.yml file for this example.

[ ]:
import sys

sys.path.append("../")

import pandas as pd
import torch

from beexai.dataset.dataset import Dataset
from beexai.dataset.load_data import load_data
from beexai.training.train import Trainer
from beexai.utils.path import create_dir
from beexai.utils.time_seed import set_seed

seed = 42
set_seed(seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

DATA_NAME = "kickstarter"
MODEL_NAME = "NeuralNetwork"

add_list = [
    (
        "duration",
        lambda y: (pd.to_datetime(y["deadline"]) - pd.to_datetime(y["launched"])).apply(
            lambda x: x.days
        ),
        None,
    )
]
values_to_delete = [("country", 'N,0"'), ("state", "live")]

create_dir(f"../output/data")
CONFIG_PATH = f"config/{DATA_NAME}.yml"
data_test, target_col, task, dataCleaner = load_data(
    from_cleaned=False,
    config_path=CONFIG_PATH,
    keep_corr_features=True,
    values_to_delete=values_to_delete,
    add_list=add_list,
)

For this example, we will add a column duration which is the difference between the deadline and launched columns. We will also drop the entries with value N,0 for the column country and values live for the column state.

load_data function also allows to remove correlated features with a default threshold of 70%, in this example we will keep all columns.

Scale data

Features scaling and encoding are directly handled the split creations of get_train_test method of Dataset class. We use ordinal encoding here for categorical features to reduce dataset dimensionality but in the case of one-hot encoding, it is possible to exclude columns with too much unique values like name here.

[ ]:
data = Dataset(data_test, target_col)
scale_params = {
    "x_num_scaler_name": "quantile_normal",
    "x_cat_encoder_name": "ordinalencoder",
    "y_scaler_name": "labelencoder",  # change to minmax or another float scaler for regression
    "cat_not_to_onehot": ["name"],
}
X_train, X_test, y_train, y_test = data.get_train_test(
    test_size=0.2, scaler_params=scale_params
)

Train the model

In the case of a neural network, we need to specify the input and output shape of the model.

[ ]:
NUM_LABELS = data.get_classes_num(task)
NN_PARAMS = {"input_dim": X_train.shape[1], "output_dim": NUM_LABELS}
trainer = Trainer(MODEL_NAME, task, NN_PARAMS, device=device)
trainer.train(X_train, y_train, loss_file="../output/loss.png")

Evaluation and saving

You can get the metrics on the test set for your model ( accuracy/f1-score for classification, mse/rmse/r2-score/mape for regression).

[ ]:
trainer.model.eval()

metrics = trainer.get_metrics(X_test, y_test)
for k, v in metrics.items():
    print(k, v)

Two formats are available for saving your model: pt and joblib. The pt format is made for PyTorch models and the joblib format is made for sklearn models.

[ ]:
create_dir(f"output/models/{DATA_NAME}")
trainer.save_model(f"../output/models/{DATA_NAME}/{MODEL_NAME}.pt")

Next steps

  • Go to Explain to get explainability scores for the model you just trained.

  • Go to Metric to get explainability metrics for the method and the model of your choice.