Abacus Preprocessing: Preprocessing (prepro.py)ΒΆ
This module acts as a central point for orchestrating data preprocessing steps in Abacus. It includes functions for initial data validation and splitting, decorators for identifying preprocessing methods, and classes defining common scaling transformations for features (X) and the target variable (y).
Core Functions
validate_data
def validate_data(
input_filename: str,
config_filename: str,
holidays_filename: Optional[str],
) -> tuple[dict, dict, pd.DataFrame, InputData, DataToFit, pd.DataFrame]:
Loads, parses, converts, and performs initial validation checks on the input data.
This function orchestrates several steps from other prepro modules:
Loads configuration using
config.load_config.Parses raw data using
decode.parse_csv_generic.Performs initial processing (filtering, Prophet, etc.) using
decode._parse_csv_shared.Transforms the parsed data into an
InputDataobject usingconvert.transform_input_generic.Converts the
InputDatainto aDataToFitobject usingDataToFit.from_input_data.Runs validation checks from
abacus.diagnostics.input_validator(duplicate columns, NaNs, date format, column variance) on the processed data.
Parameters:
input_filename(str): Path to the input data file (CSV or Excel).config_filename(str): Path to the YAML configuration file.holidays_filename(Optional[str]): Path to the optional holidays file (CSV or Excel).
Returns:
tuple: A tuple containing:config_raw(dict): Raw configuration loaded from file.config(dict): Parsed configuration dictionary.processed_data(pd.DataFrame): DataFrame after initial parsing and processing by_parse_csv_shared.input_data(InputData): The structuredInputDataobject.data_to_fit(DataToFit): TheDataToFitobject ready for scaling/modelling.per_observation_df(pd.DataFrame): DataFrame representation of the unscaledDataToFitobject.
split_data
def split_data(
input_data_processed: pd.DataFrame, config: dict
) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
Splits the processed input DataFrame into training and testing sets for features (X) and target (y).
The split is performed based on the train_test_ratio specified in the config dictionary.
Parameters:
input_data_processed(pd.DataFrame): The DataFrame containing the processed data (features and target).config(dict): The configuration dictionary, checked fortarget_colandtrain_test_ratio.
Returns:
tuple: A tuple containing:X(pd.DataFrame): All features (full dataset).y(pd.Series): Target variable (full dataset).X_train(pd.DataFrame): Training set features.y_train(pd.Series): Training set target.X_test(pd.DataFrame): Testing set features (empty ifsplit_ratiois 1.0).y_test(pd.Series): Testing set target (empty ifsplit_ratiois 1.0).
Raises:
ValueError: Ifconfig['train_test_ratio']is outside the range [0, 1].
Preprocessing Decorators
These decorators are used to tag methods within model or preprocessing classes, indicating whether they apply to features (X) or the target (y). This allows for automated discovery and application of preprocessing steps.
preprocessing_method_X
def preprocessing_method_X(method):
Decorator to mark a method as a preprocessing step for feature data (X). Adds _tags['preprocessing_X'] = True to the decorated method.
preprocessing_method_y
def preprocessing_method_y(method):
Decorator to mark a method as a preprocessing step for the target variable (y). Adds _tags['preprocessing_y'] = True to the decorated method.
Scaling Classes
These classes define specific scaling operations and store the fitted scaler object. They are likely intended to be used as mixins or base classes for models that require these scaling steps. They use custom implementations of scalers (MaxAbsScaler, StandardScaler) defined within this module.
MaxAbsScaleTarget
class MaxAbsScaleTarget:
# ... implementation ...
@preprocessing_method_y
def max_abs_scale_target_data(self, data):
# ... implementation ...
Provides a method max_abs_scale_target_data (tagged for y) that scales the target variable using MaxAbsScaler. The fitted scaler pipeline is stored in self.target_transformer.
MaxAbsScaleChannels
class MaxAbsScaleChannels:
# ... implementation ...
@preprocessing_method_X
def max_abs_scale_channel_data(self, data):
# ... implementation ...
Provides a method max_abs_scale_channel_data (tagged for X) that scales specified channel_columns using MaxAbsScaler. The fitted scaler pipeline is stored in self.channel_transformer. Requires self.channel_columns to be defined.
StandardizeControls
class StandardizeControls:
# ... implementation ...
@preprocessing_method_X
def standardize_control_data(self, data):
# ... implementation ...
Provides a method standardize_control_data (tagged for X) that standardises specified control_columns using StandardScaler. The fitted scaler pipeline is stored in self.control_transformer. Requires self.control_columns to be defined (typically during initialisation).
Custom Scaler/Pipeline Implementations
(Note: This module includes custom, simplified implementations of Pipeline, MaxAbsScaler, and StandardScaler. These appear intended for internal use or testing, potentially to minimize direct dependencies on scikit-learn in certain contexts, although the custom scalers themselves import NumPy.)