# Abacus Preprocessing: Preprocessing (`prepro.py`) This module acts as a central point for orchestrating data preprocessing steps in Abacus. It includes functions for initial data validation and splitting, decorators for identifying preprocessing methods, and classes defining common scaling transformations for features (X) and the target variable (y). ::: abacus.prepro.prepro ## Core Functions ### `validate_data` ```python def validate_data( input_filename: str, config_filename: str, holidays_filename: Optional[str], ) -> tuple[dict, dict, pd.DataFrame, InputData, DataToFit, pd.DataFrame]: ``` Loads, parses, converts, and performs initial validation checks on the input data. This function orchestrates several steps from other `prepro` modules: 1. Loads configuration using `config.load_config`. 2. Parses raw data using `decode.parse_csv_generic`. 3. Performs initial processing (filtering, Prophet, etc.) using `decode._parse_csv_shared`. 4. Transforms the parsed data into an `InputData` object using `convert.transform_input_generic`. 5. Converts the `InputData` into a `DataToFit` object using `DataToFit.from_input_data`. 6. Runs validation checks from `abacus.diagnostics.input_validator` (duplicate columns, NaNs, date format, column variance) on the processed data. **Parameters:** - `input_filename` (`str`): Path to the input data file (CSV or Excel). - `config_filename` (`str`): Path to the YAML configuration file. - `holidays_filename` (`Optional[str]`): Path to the optional holidays file (CSV or Excel). **Returns:** - `tuple`: A tuple containing: - `config_raw` (`dict`): Raw configuration loaded from file. - `config` (`dict`): Parsed configuration dictionary. - `processed_data` (`pd.DataFrame`): DataFrame after initial parsing and processing by `_parse_csv_shared`. - `input_data` (`InputData`): The structured `InputData` object. - `data_to_fit` (`DataToFit`): The `DataToFit` object ready for scaling/modelling. - `per_observation_df` (`pd.DataFrame`): DataFrame representation of the unscaled `DataToFit` object. --- ### `split_data` ```python def split_data( input_data_processed: pd.DataFrame, config: dict ) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]: ``` Splits the processed input DataFrame into training and testing sets for features (X) and target (y). The split is performed based on the `train_test_ratio` specified in the `config` dictionary. **Parameters:** - `input_data_processed` (`pd.DataFrame`): The DataFrame containing the processed data (features and target). - `config` (`dict`): The configuration dictionary, checked for `target_col` and `train_test_ratio`. **Returns:** - `tuple`: A tuple containing: - `X` (`pd.DataFrame`): All features (full dataset). - `y` (`pd.Series`): Target variable (full dataset). - `X_train` (`pd.DataFrame`): Training set features. - `y_train` (`pd.Series`): Training set target. - `X_test` (`pd.DataFrame`): Testing set features (empty if `split_ratio` is 1.0). - `y_test` (`pd.Series`): Testing set target (empty if `split_ratio` is 1.0). **Raises:** - `ValueError`: If `config['train_test_ratio']` is outside the range [0, 1]. ## Preprocessing Decorators These decorators are used to tag methods within model or preprocessing classes, indicating whether they apply to features (X) or the target (y). This allows for automated discovery and application of preprocessing steps. ### `preprocessing_method_X` ```python def preprocessing_method_X(method): ``` Decorator to mark a method as a preprocessing step for feature data (X). Adds `_tags['preprocessing_X'] = True` to the decorated method. --- ### `preprocessing_method_y` ```python def preprocessing_method_y(method): ``` Decorator to mark a method as a preprocessing step for the target variable (y). Adds `_tags['preprocessing_y'] = True` to the decorated method. ## Scaling Classes These classes define specific scaling operations and store the fitted scaler object. They are likely intended to be used as mixins or base classes for models that require these scaling steps. They use custom implementations of scalers (`MaxAbsScaler`, `StandardScaler`) defined within this module. ### `MaxAbsScaleTarget` ```python class MaxAbsScaleTarget: # ... implementation ... @preprocessing_method_y def max_abs_scale_target_data(self, data): # ... implementation ... ``` Provides a method `max_abs_scale_target_data` (tagged for `y`) that scales the target variable using `MaxAbsScaler`. The fitted scaler pipeline is stored in `self.target_transformer`. --- ### `MaxAbsScaleChannels` ```python class MaxAbsScaleChannels: # ... implementation ... @preprocessing_method_X def max_abs_scale_channel_data(self, data): # ... implementation ... ``` Provides a method `max_abs_scale_channel_data` (tagged for `X`) that scales specified `channel_columns` using `MaxAbsScaler`. The fitted scaler pipeline is stored in `self.channel_transformer`. Requires `self.channel_columns` to be defined. --- ### `StandardizeControls` ```python class StandardizeControls: # ... implementation ... @preprocessing_method_X def standardize_control_data(self, data): # ... implementation ... ``` Provides a method `standardize_control_data` (tagged for `X`) that standardises specified `control_columns` using `StandardScaler`. The fitted scaler pipeline is stored in `self.control_transformer`. Requires `self.control_columns` to be defined (typically during initialisation). ## Custom Scaler/Pipeline Implementations *(Note: This module includes custom, simplified implementations of `Pipeline`, `MaxAbsScaler`, and `StandardScaler`. These appear intended for internal use or testing, potentially to minimize direct dependencies on scikit-learn in certain contexts, although the custom scalers themselves import NumPy.)*