Abacus Preprocessing: Preprocessing (prepro.py)ΒΆ

This module acts as a central point for orchestrating data preprocessing steps in Abacus. It includes functions for initial data validation and splitting, decorators for identifying preprocessing methods, and classes defining common scaling transformations for features (X) and the target variable (y).

Core Functions

validate_data

def validate_data(
    input_filename: str,
    config_filename: str,
    holidays_filename: Optional[str],
) -> tuple[dict, dict, pd.DataFrame, InputData, DataToFit, pd.DataFrame]:

Loads, parses, converts, and performs initial validation checks on the input data.

This function orchestrates several steps from other prepro modules:

  1. Loads configuration using config.load_config.

  2. Parses raw data using decode.parse_csv_generic.

  3. Performs initial processing (filtering, Prophet, etc.) using decode._parse_csv_shared.

  4. Transforms the parsed data into an InputData object using convert.transform_input_generic.

  5. Converts the InputData into a DataToFit object using DataToFit.from_input_data.

  6. Runs validation checks from abacus.diagnostics.input_validator (duplicate columns, NaNs, date format, column variance) on the processed data.

Parameters:

  • input_filename (str): Path to the input data file (CSV or Excel).

  • config_filename (str): Path to the YAML configuration file.

  • holidays_filename (Optional[str]): Path to the optional holidays file (CSV or Excel).

Returns:

  • tuple: A tuple containing:

    • config_raw (dict): Raw configuration loaded from file.

    • config (dict): Parsed configuration dictionary.

    • processed_data (pd.DataFrame): DataFrame after initial parsing and processing by _parse_csv_shared.

    • input_data (InputData): The structured InputData object.

    • data_to_fit (DataToFit): The DataToFit object ready for scaling/modelling.

    • per_observation_df (pd.DataFrame): DataFrame representation of the unscaled DataToFit object.


split_data

def split_data(
    input_data_processed: pd.DataFrame, config: dict
) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:

Splits the processed input DataFrame into training and testing sets for features (X) and target (y).

The split is performed based on the train_test_ratio specified in the config dictionary.

Parameters:

  • input_data_processed (pd.DataFrame): The DataFrame containing the processed data (features and target).

  • config (dict): The configuration dictionary, checked for target_col and train_test_ratio.

Returns:

  • tuple: A tuple containing:

    • X (pd.DataFrame): All features (full dataset).

    • y (pd.Series): Target variable (full dataset).

    • X_train (pd.DataFrame): Training set features.

    • y_train (pd.Series): Training set target.

    • X_test (pd.DataFrame): Testing set features (empty if split_ratio is 1.0).

    • y_test (pd.Series): Testing set target (empty if split_ratio is 1.0).

Raises:

  • ValueError: If config['train_test_ratio'] is outside the range [0, 1].

Preprocessing Decorators

These decorators are used to tag methods within model or preprocessing classes, indicating whether they apply to features (X) or the target (y). This allows for automated discovery and application of preprocessing steps.

preprocessing_method_X

def preprocessing_method_X(method):

Decorator to mark a method as a preprocessing step for feature data (X). Adds _tags['preprocessing_X'] = True to the decorated method.


preprocessing_method_y

def preprocessing_method_y(method):

Decorator to mark a method as a preprocessing step for the target variable (y). Adds _tags['preprocessing_y'] = True to the decorated method.

Scaling Classes

These classes define specific scaling operations and store the fitted scaler object. They are likely intended to be used as mixins or base classes for models that require these scaling steps. They use custom implementations of scalers (MaxAbsScaler, StandardScaler) defined within this module.

MaxAbsScaleTarget

class MaxAbsScaleTarget:
    # ... implementation ...
    @preprocessing_method_y
    def max_abs_scale_target_data(self, data):
        # ... implementation ...

Provides a method max_abs_scale_target_data (tagged for y) that scales the target variable using MaxAbsScaler. The fitted scaler pipeline is stored in self.target_transformer.


MaxAbsScaleChannels

class MaxAbsScaleChannels:
    # ... implementation ...
    @preprocessing_method_X
    def max_abs_scale_channel_data(self, data):
        # ... implementation ...

Provides a method max_abs_scale_channel_data (tagged for X) that scales specified channel_columns using MaxAbsScaler. The fitted scaler pipeline is stored in self.channel_transformer. Requires self.channel_columns to be defined.


StandardizeControls

class StandardizeControls:
    # ... implementation ...
    @preprocessing_method_X
    def standardize_control_data(self, data):
        # ... implementation ...

Provides a method standardize_control_data (tagged for X) that standardises specified control_columns using StandardScaler. The fitted scaler pipeline is stored in self.control_transformer. Requires self.control_columns to be defined (typically during initialisation).

Custom Scaler/Pipeline Implementations

(Note: This module includes custom, simplified implementations of Pipeline, MaxAbsScaler, and StandardScaler. These appear intended for internal use or testing, potentially to minimize direct dependencies on scikit-learn in certain contexts, although the custom scalers themselves import NumPy.)