beexai.dataset package

Submodules

beexai.dataset.dataset module

Creation of dataset splits and encoding/scaling of features

class beexai.dataset.dataset.Dataset(data: DataFrame, target_name: str)[source]

Bases: object

Dataset class to create train/test splits.

target_name

name of the target column

Type:

str

data

input data

Type:

pd.DataFrame

target

target column

Type:

pd.Series

get_train_test()[source]

create train/test splits

get_classes_num()[source]

get the number of classes for the task

Parameters:
  • data (pd.DataFrame) – input data

  • target_name (str) – name of the target column

get_classes_num(task: str) int[source]

Get the number of classes for the task. Return 1 for regression.

Parameters:

task (str) – task type

Returns:

number of classes

Return type:

int

get_train_test(test_size: float = 0.2, scaler_params: Dict[str, str] | None = None) Tuple[DataFrame, DataFrame, Series, Series][source]

Create train/test splits.

Parameters:
  • test_size (float, optional) – test size. Defaults to 0.2.

  • scaler_params (Optional[Dict[str, str]], optional) – scaling parameters. Defaults to None.

Returns:

x_train, x_test, y_train, y_test

Return type:

tuple

get_train_val(x_train: DataFrame, y_train: Series, val_size: float = 0.2) Tuple[DataFrame, DataFrame, Series, Series][source]

Create train/val splits.

Parameters:
  • x_train (pd.DataFrame) – input data

  • y_train (pd.Series) – target column

  • val_size (float, optional) – validation size. Defaults to 0.2.

Returns:

x_train, x_val, y_train, y_val

Return type:

tuple

class beexai.dataset.dataset.Scaler(df: DataFrame, target_col: str | None = None, x_num_scaler_name: str | None = None, x_cat_encoder_name: str | None = None, y_scaler_name: str | None = None, cat_not_to_onehot: List[str] | None = [])[source]

Bases: object

Class for scaling the data.

df

input dataframe

Type:

pd.DataFrame

target_col

target column name

Type:

str

categorical_cols

list of categorical columns

Type:

list

x_num_scaler_name

scaler to use for x. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust. Defaults to None.

Type:

str

x_cat_encoder_name

scaler to use for x. Must be either None or one of labelencoder or onehotencoder. Defaults to None.

Type:

str

y_scaler_name

scaler to use for y. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust or labelencoder. Defaults to None.

Type:

str

cat_not_to_onehot

list of categorical columns not to one hot encode. Defaults to [].

Type:

List[str]

scalers

dictionary of possible scalers

Type:

dict

encode_categorical()[source]

encode categorical columns in one hot or label encoding

do_scaling()[source]

process data from categorical features first to numerical features scaling

Parameters:
  • df (pd.DataFrame) – input dataframe

  • target_col (str, optional) – target column name. Defaults to None.

  • x_num_scaler_name (Optional[str], optional) – scaler to use for x. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust. Defaults to None.

  • x_cat_encoder_name (Optional[str], optional) – scaler to use for x. Must be either None or one of labelencoder or onehotencoder. Defaults to None.

  • y_scaler_name (Optional[str], optional) – scaler to use for y. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust or labelencoder. Defaults to None.

  • cat_not_to_onehot (Optional[List[str]], optional) – list of categorical columns not to one hot encode. Defaults to [].

do_scaling(x_train: DataFrame, x_test: DataFrame | None, y_train: Series, y_test: Series | None) Tuple[DataFrame, DataFrame, Series, Series][source]

Create train/test splits.

Parameters:
  • x_train (pd.DataFrame) – train dataframe

  • x_test (pd.DataFrame) – test dataframe

  • y_train (pd.Series) – train targets

  • y_test (pd.Series) – test targets

Returns:

x_train, x_test, y_train, y_test

Return type:

tuple

encode_categorical(x_train: DataFrame, x_test: DataFrame | None, x_cat_encoder: str | None = 'labelencoder', cat_not_to_onehot: List[str] | None = []) DataFrame[source]

Encode categorical columns with LabelEncoder or OneHotEncoder

Parameters:
  • x_train (pd.DataFrame) – train dataframe with categorical columns

  • x_test (pd.DataFrame) – test dataframe with categorical columns

  • x_cat_encoder (Optional[str], optional) – encoding type for categorical columns

  • cat_not_to_onehot (Optional[List[str]], optional) – list of categorical columns not to one hot encode. For example if the dimensionality is too high. Defaults to [].

Returns:

dataframe with encoded categorical columns

Return type:

pd.DataFrame

beexai.dataset.load_data module

Provides a fast way to load data and preprocess it

class beexai.dataset.load_data.DataCleaner(df, target_col: str, corr_threshold: float = 0.7)[source]

Bases: object

Clean the data by removing correlated features

df

input dataframe

Type:

pd.DataFrame

target_col

target column name

Type:

str

corr_threshold

correlation threshold

Type:

float

compute_correlation_matrix()[source]

compute the correlation matrix

plot_corr_matrix()[source]

plot the correlation matrix

remove_correlated_features()[source]

remove correlated features from the dataframe with a threshold

clean_data()[source]

clean the data

Parameters:
  • df (pd.DataFrame) – input dataframe

  • target_col (str) – target column name

  • corr_threshold (float, optional) – correlation threshold. Defaults to 0.7.

clean_data() DataFrame[source]

Clean the data

Returns:

dataframe

Return type:

pd.DataFrame

compute_correlation_matrix(df: DataFrame) DataFrame[source]

Compute the correlation matrix

Parameters:

df (pd.DataFrame) – dataframe

Returns:

correlation matrix

Return type:

pd.DataFrame

plot_corr_matrix(df: DataFrame) None[source]

Plot the correlation matrix

Parameters:

df (pd.DataFrame) – dataframe

remove_correlated_features(df: DataFrame) DataFrame[source]

Remove correlated features from the dataframe with a threshold

Parameters:

df (pd.DataFrame) – dataframe

Returns:

dataframe without correlated features

Return type:

pd.DataFrame

class beexai.dataset.load_data.LoadData(path: str)[source]

Bases: object

Load data from a csv file and return a dataframe

path

path to the csv file

Type:

str

load_csv()[source]

load the csv file

Parameters:

path (str) – path to the csv file

load_csv(keep_index: bool = False) DataFrame[source]

Load the csv file

Parameters:

keep_index (bool) – whether to keep the index or not. Defaults to False.

Returns:

dataframe

Return type:

pd.DataFrame

class beexai.dataset.load_data.Preprocessor(df: DataFrame, target_col: str | None = None, values_to_delete: List[Tuple[str, str]] | None = None, datetime_cols: List[str] | None = None, add_cols: List[Tuple[str, str, str]] | None = None, cols_to_delete: List[str] | None = None)[source]

Bases: object

Preprocess the data by deleting entries, adding new columns, converting to datetime and adding date infos.

df

input dataframe

Type:

pd.DataFrame

encoder

encoder to use

Type:

object

target_col

target column name

Type:

str

datetime_cols

list of datetime columns

Type:

list

values_to_delete

list of tuples (col_name,value to delete)

Type:

list

add_cols

list of tuples (new_col_name,new_col_value, cast_to_type)

Type:

list

cols_to_delete

list of columns to delete

Type:

list

delete_entries()[source]

delete entries from the dataframe

add_entries()[source]

add new colums to the dataframe with new values. These values are combinations of existing columns.

convert_to_datetime()[source]

convert columns to datetime

add_date_infos()[source]

add year, month, day and hour to the dataframe

preprocess()[source]

preprocess the data

save_cleaned_data()[source]

save the cleaned data

Parameters:
  • df (pd.DataFrame) – input dataframe

  • target_col (str, optional) – target column name. Defaults to None.

  • values_to_delete (List[Tuple[str,str]], optional) – (col_name,value to delete) values to delete from the dataframe. Defaults to None.

  • datetime_cols (list, optional) – columns to convert to datetime. Defaults to None.

  • add_cols (list, optional) – (new_col_name,new_col_value,cast_to_type) columns to add to the dataframe. Defaults to None.

  • cols_to_delete (list, optional) – columns to delete from the dataframe. Defaults to None.

add_date_infos(df: DataFrame, col: str) DataFrame[source]

Add year, month, day and hour to the dataframe

Parameters:
  • df (pd.DataFrame) – dataframe

  • col (str) – column name

Returns:

dataframe with new columns

Return type:

pd.DataFrame

add_entries(df: DataFrame, add_cols: List[Tuple[str, str, str]]) DataFrame[source]

Add new colums to the dataframe with new values. These values are combinations of existing columns.

Parameters:
  • df (pd.DataFrame) – input dataframe

  • add_cols (list) – list of tuples (new_col_name,new_col_value,cast_to_type)

Returns:

dataframe with new columns

Return type:

pd.DataFrame

convert_to_datetime(df: DataFrame) DataFrame[source]

Convert columns to datetime

Parameters:

df (pd.DataFrame) – dataframe

Returns:

dataframe with datetime columns

Return type:

pd.DataFrame

delete_entries(df: DataFrame, values_to_delete: List[Tuple[str, str]]) DataFrame[source]

Delete entries from the dataframe

Parameters:
  • df (pd.DataFrame) – input dataframe

  • values_to_delete (list) – list of tuples (col_name,value to delete)

Returns:

dataframe without the specified values

Return type:

pd.DataFrame

preprocess() DataFrame[source]

Preprocess the data

Returns:

dataframe

Return type:

pd.DataFrame

save_cleaned_data(df: DataFrame, path: str) None[source]

Save the cleaned data

Parameters:
  • df (pd.DataFrame) – dataframe

  • path (str) – path to save the dataframe

beexai.dataset.load_data.fast_load(config_path: str, values_to_delete: List[Tuple[str, str]] | None = None, adding_cols: List[Tuple[str, str, str]] | None = None, keep_corr_features: bool = True) List[source]

Provides a fast way to load data and preprocess it

Parameters:
  • config_path (str) – path to the config file

  • values_to_delete (list, optional) – list of tuples (col_name,value to delete). Defaults to None.

  • adding_cols (list, optional) – list of tuples (col_name,fun_to_add,cast_to_type). Defaults to None.

  • keep_corr_features (bool, optional) – whether to keep correlated features or not. Defaults to True.

Returns:

a list containing the data, the target column name,

the task and the data_cleaner object

Return type:

list

beexai.dataset.load_data.load_data(from_cleaned: bool, config_path: str, values_to_delete: List[Tuple[str, str]] | None = None, add_list: List[Tuple[str, str, str]] | None = None, keep_corr_features: bool = True) List[source]

Load data from a config file

Parameters:
  • from_cleaned (bool) – whether to load the data directly from the cleaned data or not

  • config_path (str) – path to the config file

  • values_to_delete (list, optional) – list of tuples (col_name,value to delete). Defaults to None.

  • add_list (list, optional) – list of tuples (col_name,fun_to_add,cast_to_type). Defaults to None.

  • keep_corr_features (bool, optional) – whether to keep correlated features or not. Defaults to True.

Returns:

a list containing the data, the target column name,

the task and the data_cleaner object

Return type:

list

Module contents