beexai.dataset package

Submodules

beexai.dataset.dataset module

Creation of dataset splits and encoding/scaling of features

class beexai.dataset.dataset.Dataset(data: DataFrame, target_name: str)[source]

Bases: object

Dataset class to create train/test splits.

target_name

name of the target column

Type:: str

data

input data

Type:: pd.DataFrame

target

target column

Type:: pd.Series

get_train_test()[source]: create train/test splits

get_classes_num()[source]: get the number of classes for the task

Parameters:

data (pd.DataFrame) – input data
target_name (str) – name of the target column

get_classes_num(task: str) → int[source]

Get the number of classes for the task. Return 1 for regression.

Parameters:: task (str) – task type
Returns:: number of classes
Return type:: int

get_train_test(test_size: float = 0.2, scaler_params: Dict[str, str] | None = None) → Tuple[DataFrame, DataFrame, Series, Series][source]

Create train/test splits.

Parameters:

test_size (float, optional) – test size. Defaults to 0.2.
scaler_params (Optional[Dict[str, str]], optional) – scaling parameters. Defaults to None.

Returns:

x_train, x_test, y_train, y_test

Return type:

tuple

get_train_val(x_train: DataFrame, y_train: Series, val_size: float = 0.2) → Tuple[DataFrame, DataFrame, Series, Series][source]

Create train/val splits.

Parameters:

x_train (pd.DataFrame) – input data
y_train (pd.Series) – target column
val_size (float, optional) – validation size. Defaults to 0.2.

Returns:

x_train, x_val, y_train, y_val

Return type:

tuple

class beexai.dataset.dataset.Scaler(df: DataFrame, target_col: str | None = None, x_num_scaler_name: str | None = None, x_cat_encoder_name: str | None = None, y_scaler_name: str | None = None, cat_not_to_onehot: List[str] | None = [])[source]

Bases: object

Class for scaling the data.

df

input dataframe

Type:: pd.DataFrame

target_col

target column name

Type:: str

categorical_cols

list of categorical columns

Type:: list

x_num_scaler_name

scaler to use for x. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust. Defaults to None.

Type:: str

x_cat_encoder_name

scaler to use for x. Must be either None or one of labelencoder or onehotencoder. Defaults to None.

Type:: str

y_scaler_name

scaler to use for y. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust or labelencoder. Defaults to None.

Type:: str

cat_not_to_onehot

list of categorical columns not to one hot encode. Defaults to [].

Type:: List[str]

scalers

dictionary of possible scalers

Type:: dict

encode_categorical()[source]: encode categorical columns in one hot or label encoding

do_scaling()[source]: process data from categorical features first to numerical features scaling

Parameters:

df (pd.DataFrame) – input dataframe
target_col (str, optional) – target column name. Defaults to None.
x_num_scaler_name (Optional[str], optional) – scaler to use for x. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust. Defaults to None.
x_cat_encoder_name (Optional[str], optional) – scaler to use for x. Must be either None or one of labelencoder or onehotencoder. Defaults to None.
y_scaler_name (Optional[str], optional) – scaler to use for y. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust or labelencoder. Defaults to None.
cat_not_to_onehot (Optional[List[str]], optional) – list of categorical columns not to one hot encode. Defaults to [].

do_scaling(x_train: DataFrame, x_test: DataFrame | None, y_train: Series, y_test: Series | None) → Tuple[DataFrame, DataFrame, Series, Series][source]

Create train/test splits.

Parameters:

x_train (pd.DataFrame) – train dataframe
x_test (pd.DataFrame) – test dataframe
y_train (pd.Series) – train targets
y_test (pd.Series) – test targets

Returns:

x_train, x_test, y_train, y_test

Return type:

tuple

encode_categorical(x_train: DataFrame, x_test: DataFrame | None, x_cat_encoder: str | None = 'labelencoder', cat_not_to_onehot: List[str] | None = []) → DataFrame[source]

Encode categorical columns with LabelEncoder or OneHotEncoder

Parameters:

x_train (pd.DataFrame) – train dataframe with categorical columns
x_test (pd.DataFrame) – test dataframe with categorical columns
x_cat_encoder (Optional[str], optional) – encoding type for categorical columns
cat_not_to_onehot (Optional[List[str]], optional) – list of categorical columns not to one hot encode. For example if the dimensionality is too high. Defaults to [].

Returns:

dataframe with encoded categorical columns

Return type:

pd.DataFrame

beexai.dataset.load_data module

Provides a fast way to load data and preprocess it

class beexai.dataset.load_data.DataCleaner(df, target_col: str, corr_threshold: float = 0.7)[source]

Bases: object

Clean the data by removing correlated features

df

input dataframe

Type:: pd.DataFrame

target_col

target column name

Type:: str

corr_threshold

correlation threshold

Type:: float

compute_correlation_matrix()[source]: compute the correlation matrix

plot_corr_matrix()[source]: plot the correlation matrix

remove_correlated_features()[source]: remove correlated features from the dataframe with a threshold

clean_data()[source]: clean the data

Parameters:

df (pd.DataFrame) – input dataframe
target_col (str) – target column name
corr_threshold (float, optional) – correlation threshold. Defaults to 0.7.

clean_data() → DataFrame[source]

Clean the data

Returns:: dataframe
Return type:: pd.DataFrame

compute_correlation_matrix(df: DataFrame) → DataFrame[source]

Compute the correlation matrix

Parameters:: df (pd.DataFrame) – dataframe
Returns:: correlation matrix
Return type:: pd.DataFrame

plot_corr_matrix(df: DataFrame) → None[source]

Plot the correlation matrix

Parameters:: df (pd.DataFrame) – dataframe

remove_correlated_features(df: DataFrame) → DataFrame[source]

Remove correlated features from the dataframe with a threshold

Parameters:: df (pd.DataFrame) – dataframe
Returns:: dataframe without correlated features
Return type:: pd.DataFrame

class beexai.dataset.load_data.LoadData(path: str)[source]

Bases: object

Load data from a csv file and return a dataframe

path

path to the csv file

Type:: str

load_csv()[source]: load the csv file

Parameters:: path (str) – path to the csv file

load_csv(keep_index: bool = False) → DataFrame[source]

Load the csv file

Parameters:: keep_index (bool) – whether to keep the index or not. Defaults to False.
Returns:: dataframe
Return type:: pd.DataFrame

class beexai.dataset.load_data.Preprocessor(df: DataFrame, target_col: str | None = None, values_to_delete: List[Tuple[str, str]] | None = None, datetime_cols: List[str] | None = None, add_cols: List[Tuple[str, str, str]] | None = None, cols_to_delete: List[str] | None = None)[source]

Bases: object

Preprocess the data by deleting entries, adding new columns, converting to datetime and adding date infos.

df

input dataframe

Type:: pd.DataFrame

encoder

encoder to use

Type:: object

target_col

target column name

Type:: str

datetime_cols

list of datetime columns

Type:: list

values_to_delete

list of tuples (col_name,value to delete)

Type:: list

add_cols

list of tuples (new_col_name,new_col_value, cast_to_type)

Type:: list

cols_to_delete

list of columns to delete

Type:: list

delete_entries()[source]: delete entries from the dataframe

add_entries()[source]: add new colums to the dataframe with new values. These values are combinations of existing columns.

convert_to_datetime()[source]: convert columns to datetime

add_date_infos()[source]: add year, month, day and hour to the dataframe

preprocess()[source]: preprocess the data

save_cleaned_data()[source]: save the cleaned data

Parameters:

df (pd.DataFrame) – input dataframe
target_col (str, optional) – target column name. Defaults to None.
values_to_delete (List[Tuple[str,str]], optional) – (col_name,value to delete) values to delete from the dataframe. Defaults to None.
datetime_cols (list, optional) – columns to convert to datetime. Defaults to None.
add_cols (list, optional) – (new_col_name,new_col_value,cast_to_type) columns to add to the dataframe. Defaults to None.
cols_to_delete (list, optional) – columns to delete from the dataframe. Defaults to None.

add_date_infos(df: DataFrame, col: str) → DataFrame[source]

Add year, month, day and hour to the dataframe

Parameters:

df (pd.DataFrame) – dataframe
col (str) – column name

Returns:

dataframe with new columns

Return type:

pd.DataFrame

add_entries(df: DataFrame, add_cols: List[Tuple[str, str, str]]) → DataFrame[source]

Add new colums to the dataframe with new values. These values are combinations of existing columns.

Parameters:

df (pd.DataFrame) – input dataframe
add_cols (list) – list of tuples (new_col_name,new_col_value,cast_to_type)

Returns:

dataframe with new columns

Return type:

pd.DataFrame

convert_to_datetime(df: DataFrame) → DataFrame[source]

Convert columns to datetime

Parameters:: df (pd.DataFrame) – dataframe
Returns:: dataframe with datetime columns
Return type:: pd.DataFrame

delete_entries(df: DataFrame, values_to_delete: List[Tuple[str, str]]) → DataFrame[source]

Delete entries from the dataframe

Parameters:

df (pd.DataFrame) – input dataframe
values_to_delete (list) – list of tuples (col_name,value to delete)

Returns:

dataframe without the specified values

Return type:

pd.DataFrame

preprocess() → DataFrame[source]

Preprocess the data

Returns:: dataframe
Return type:: pd.DataFrame

save_cleaned_data(df: DataFrame, path: str) → None[source]

Save the cleaned data

Parameters:

df (pd.DataFrame) – dataframe
path (str) – path to save the dataframe

beexai.dataset.load_data.fast_load(config_path: str, values_to_delete: List[Tuple[str, str]] | None = None, adding_cols: List[Tuple[str, str, str]] | None = None, keep_corr_features: bool = True) → List[source]

Provides a fast way to load data and preprocess it

Parameters:

config_path (str) – path to the config file
values_to_delete (list, optional) – list of tuples (col_name,value to delete). Defaults to None.
adding_cols (list, optional) – list of tuples (col_name,fun_to_add,cast_to_type). Defaults to None.
keep_corr_features (bool, optional) – whether to keep correlated features or not. Defaults to True.

Returns:

a list containing the data, the target column name,: the task and the data_cleaner object

Return type:

list

beexai.dataset.load_data.load_data(from_cleaned: bool, config_path: str, values_to_delete: List[Tuple[str, str]] | None = None, add_list: List[Tuple[str, str, str]] | None = None, keep_corr_features: bool = True) → List[source]

Load data from a config file

Parameters:

from_cleaned (bool) – whether to load the data directly from the cleaned data or not
config_path (str) – path to the config file
values_to_delete (list, optional) – list of tuples (col_name,value to delete). Defaults to None.
add_list (list, optional) – list of tuples (col_name,fun_to_add,cast_to_type). Defaults to None.
keep_corr_features (bool, optional) – whether to keep correlated features or not. Defaults to True.

Returns:

a list containing the data, the target column name,: the task and the data_cleaner object

Return type:

list

beexai.dataset package

Submodules

beexai.dataset.dataset module

beexai.dataset.load_data module

Module contents