beexai.dataset package
Submodules
beexai.dataset.dataset module
Creation of dataset splits and encoding/scaling of features
- class beexai.dataset.dataset.Dataset(data: DataFrame, target_name: str)[source]
Bases:
objectDataset class to create train/test splits.
- data
input data
- Type:
pd.DataFrame
- target
target column
- Type:
pd.Series
- Parameters:
data (pd.DataFrame) – input data
target_name (str) – name of the target column
- get_classes_num(task: str) int[source]
Get the number of classes for the task. Return 1 for regression.
- get_train_test(test_size: float = 0.2, scaler_params: Dict[str, str] | None = None) Tuple[DataFrame, DataFrame, Series, Series][source]
Create train/test splits.
- class beexai.dataset.dataset.Scaler(df: DataFrame, target_col: str | None = None, x_num_scaler_name: str | None = None, x_cat_encoder_name: str | None = None, y_scaler_name: str | None = None, cat_not_to_onehot: List[str] | None = [])[source]
Bases:
objectClass for scaling the data.
- df
input dataframe
- Type:
pd.DataFrame
- x_num_scaler_name
scaler to use for x. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust. Defaults to None.
- Type:
- x_cat_encoder_name
scaler to use for x. Must be either None or one of labelencoder or onehotencoder. Defaults to None.
- Type:
- y_scaler_name
scaler to use for y. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust or labelencoder. Defaults to None.
- Type:
- cat_not_to_onehot
list of categorical columns not to one hot encode. Defaults to [].
- Type:
List[str]
- Parameters:
df (pd.DataFrame) – input dataframe
target_col (str, optional) – target column name. Defaults to None.
x_num_scaler_name (Optional[str], optional) – scaler to use for x. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust. Defaults to None.
x_cat_encoder_name (Optional[str], optional) – scaler to use for x. Must be either None or one of labelencoder or onehotencoder. Defaults to None.
y_scaler_name (Optional[str], optional) – scaler to use for y. Must be either None or one of standard, minmax, quantile_normal, quantile_uniform, maxabs, robust or labelencoder. Defaults to None.
cat_not_to_onehot (Optional[List[str]], optional) – list of categorical columns not to one hot encode. Defaults to [].
- do_scaling(x_train: DataFrame, x_test: DataFrame | None, y_train: Series, y_test: Series | None) Tuple[DataFrame, DataFrame, Series, Series][source]
Create train/test splits.
- Parameters:
x_train (pd.DataFrame) – train dataframe
x_test (pd.DataFrame) – test dataframe
y_train (pd.Series) – train targets
y_test (pd.Series) – test targets
- Returns:
x_train, x_test, y_train, y_test
- Return type:
- encode_categorical(x_train: DataFrame, x_test: DataFrame | None, x_cat_encoder: str | None = 'labelencoder', cat_not_to_onehot: List[str] | None = []) DataFrame[source]
Encode categorical columns with LabelEncoder or OneHotEncoder
- Parameters:
x_train (pd.DataFrame) – train dataframe with categorical columns
x_test (pd.DataFrame) – test dataframe with categorical columns
x_cat_encoder (Optional[str], optional) – encoding type for categorical columns
cat_not_to_onehot (Optional[List[str]], optional) – list of categorical columns not to one hot encode. For example if the dimensionality is too high. Defaults to [].
- Returns:
dataframe with encoded categorical columns
- Return type:
pd.DataFrame
beexai.dataset.load_data module
Provides a fast way to load data and preprocess it
- class beexai.dataset.load_data.DataCleaner(df, target_col: str, corr_threshold: float = 0.7)[source]
Bases:
objectClean the data by removing correlated features
- df
input dataframe
- Type:
pd.DataFrame
remove correlated features from the dataframe with a threshold
- Parameters:
- compute_correlation_matrix(df: DataFrame) DataFrame[source]
Compute the correlation matrix
- Parameters:
df (pd.DataFrame) – dataframe
- Returns:
correlation matrix
- Return type:
pd.DataFrame
- class beexai.dataset.load_data.LoadData(path: str)[source]
Bases:
objectLoad data from a csv file and return a dataframe
- Parameters:
path (str) – path to the csv file
- class beexai.dataset.load_data.Preprocessor(df: DataFrame, target_col: str | None = None, values_to_delete: List[Tuple[str, str]] | None = None, datetime_cols: List[str] | None = None, add_cols: List[Tuple[str, str, str]] | None = None, cols_to_delete: List[str] | None = None)[source]
Bases:
objectPreprocess the data by deleting entries, adding new columns, converting to datetime and adding date infos.
- df
input dataframe
- Type:
pd.DataFrame
- add_entries()[source]
add new colums to the dataframe with new values. These values are combinations of existing columns.
- Parameters:
df (pd.DataFrame) – input dataframe
target_col (str, optional) – target column name. Defaults to None.
values_to_delete (List[Tuple[str,str]], optional) – (col_name,value to delete) values to delete from the dataframe. Defaults to None.
datetime_cols (list, optional) – columns to convert to datetime. Defaults to None.
add_cols (list, optional) – (new_col_name,new_col_value,cast_to_type) columns to add to the dataframe. Defaults to None.
cols_to_delete (list, optional) – columns to delete from the dataframe. Defaults to None.
- add_date_infos(df: DataFrame, col: str) DataFrame[source]
Add year, month, day and hour to the dataframe
- Parameters:
df (pd.DataFrame) – dataframe
col (str) – column name
- Returns:
dataframe with new columns
- Return type:
pd.DataFrame
- add_entries(df: DataFrame, add_cols: List[Tuple[str, str, str]]) DataFrame[source]
Add new colums to the dataframe with new values. These values are combinations of existing columns.
- Parameters:
df (pd.DataFrame) – input dataframe
add_cols (list) – list of tuples (new_col_name,new_col_value,cast_to_type)
- Returns:
dataframe with new columns
- Return type:
pd.DataFrame
- convert_to_datetime(df: DataFrame) DataFrame[source]
Convert columns to datetime
- Parameters:
df (pd.DataFrame) – dataframe
- Returns:
dataframe with datetime columns
- Return type:
pd.DataFrame
- delete_entries(df: DataFrame, values_to_delete: List[Tuple[str, str]]) DataFrame[source]
Delete entries from the dataframe
- Parameters:
df (pd.DataFrame) – input dataframe
values_to_delete (list) – list of tuples (col_name,value to delete)
- Returns:
dataframe without the specified values
- Return type:
pd.DataFrame
- beexai.dataset.load_data.fast_load(config_path: str, values_to_delete: List[Tuple[str, str]] | None = None, adding_cols: List[Tuple[str, str, str]] | None = None, keep_corr_features: bool = True) List[source]
Provides a fast way to load data and preprocess it
- Parameters:
config_path (str) – path to the config file
values_to_delete (list, optional) – list of tuples (col_name,value to delete). Defaults to None.
adding_cols (list, optional) – list of tuples (col_name,fun_to_add,cast_to_type). Defaults to None.
keep_corr_features (bool, optional) – whether to keep correlated features or not. Defaults to True.
- Returns:
- a list containing the data, the target column name,
the task and the data_cleaner object
- Return type:
- beexai.dataset.load_data.load_data(from_cleaned: bool, config_path: str, values_to_delete: List[Tuple[str, str]] | None = None, add_list: List[Tuple[str, str, str]] | None = None, keep_corr_features: bool = True) List[source]
Load data from a config file
- Parameters:
from_cleaned (bool) – whether to load the data directly from the cleaned data or not
config_path (str) – path to the config file
values_to_delete (list, optional) – list of tuples (col_name,value to delete). Defaults to None.
add_list (list, optional) – list of tuples (col_name,fun_to_add,cast_to_type). Defaults to None.
keep_corr_features (bool, optional) – whether to keep correlated features or not. Defaults to True.
- Returns:
- a list containing the data, the target column name,
the task and the data_cleaner object
- Return type: