{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How to train a model on any dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebooks shows how to load your tabular data with the implemented pipeline and train a simple Neural Network for further usage in explainability notebooks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Declare your data\n", "\n", "Don't forget to write your config file for your dataset in `config/` folder. Some basic examples are available for classification and regression. You can also add specific values based on some columns operations and delete specific values at the beginning.\n", "Other options (colums to drop,datetime colums) need to be declared directly in the config file.\n", "\n", "We will use the `config/kickstarter.yml` file for this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "sys.path.append(\"../\")\n", "\n", "import pandas as pd\n", "import torch\n", "\n", "from beexai.dataset.dataset import Dataset\n", "from beexai.dataset.load_data import load_data\n", "from beexai.training.train import Trainer\n", "from beexai.utils.path import create_dir\n", "from beexai.utils.time_seed import set_seed\n", "\n", "seed = 42\n", "set_seed(seed)\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "\n", "DATA_NAME = \"kickstarter\"\n", "MODEL_NAME = \"NeuralNetwork\"\n", "\n", "add_list = [\n", " (\n", " \"duration\",\n", " lambda y: (pd.to_datetime(y[\"deadline\"]) - pd.to_datetime(y[\"launched\"])).apply(\n", " lambda x: x.days\n", " ),\n", " None,\n", " )\n", "]\n", "values_to_delete = [(\"country\", 'N,0\"'), (\"state\", \"live\")]\n", "\n", "create_dir(f\"../output/data\")\n", "CONFIG_PATH = f\"config/{DATA_NAME}.yml\"\n", "data_test, target_col, task, dataCleaner = load_data(\n", " from_cleaned=False,\n", " config_path=CONFIG_PATH,\n", " keep_corr_features=True,\n", " values_to_delete=values_to_delete,\n", " add_list=add_list,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we will add a column `duration` which is the difference between the `deadline` and `launched` columns. We will also drop the entries with value `N,0` for the column `country` and values `live` for the column `state`.\n", "\n", "`load_data` function also allows to remove correlated features with a default threshold of 70%, in this example we will keep all columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scale data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Features scaling and encoding are directly handled the split creations of `get_train_test` method of `Dataset` class. We use ordinal encoding here for categorical features to reduce dataset dimensionality but in the case of one-hot encoding, it is possible to exclude columns with too much unique values like `name` here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = Dataset(data_test, target_col)\n", "scale_params = {\n", " \"x_num_scaler_name\": \"quantile_normal\",\n", " \"x_cat_encoder_name\": \"ordinalencoder\",\n", " \"y_scaler_name\": \"labelencoder\", # change to minmax or another float scaler for regression\n", " \"cat_not_to_onehot\": [\"name\"],\n", "}\n", "X_train, X_test, y_train, y_test = data.get_train_test(\n", " test_size=0.2, scaler_params=scale_params\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train the model\n", "\n", "In the case of a neural network, we need to specify the input and output shape of the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NUM_LABELS = data.get_classes_num(task)\n", "NN_PARAMS = {\"input_dim\": X_train.shape[1], \"output_dim\": NUM_LABELS}\n", "trainer = Trainer(MODEL_NAME, task, NN_PARAMS, device=device)\n", "trainer.train(X_train, y_train, loss_file=\"../output/loss.png\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation and saving\n", "\n", "You can get the metrics on the test set for your model ( `accuracy/f1-score` for classification, `mse/rmse/r2-score/mape` for regression)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainer.model.eval()\n", "\n", "metrics = trainer.get_metrics(X_test, y_test)\n", "for k, v in metrics.items():\n", " print(k, v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two formats are available for saving your model: `pt` and `joblib`. The `pt` format is made for PyTorch models and the `joblib` format is made for sklearn models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dir(f\"output/models/{DATA_NAME}\")\n", "trainer.save_model(f\"../output/models/{DATA_NAME}/{MODEL_NAME}.pt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next steps\n", "\n", "- Go to [Explain](explain.ipynb) to get explainability scores for the model you just trained.\n", "- Go to [Metric](metrics.ipynb) to get explainability metrics for the method and the model of your choice." ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }