# Abacus Preprocessing: Outliers (`outliers.py`) This module provides functions for identifying and handling potential outliers within the `InputData` object before model fitting. It allows for reporting potential outliers and creating a modified dataset where outliers are replaced according to specified methods. ::: abacus.prepro.outliers ## Functions ### `print_outliers` ```python def print_outliers(input_data: InputData, output_dir: str, suffix: str) -> None: ``` Identifies and reports potential outliers for each media channel, extra feature, and the target variable within the provided `InputData`. Outliers are defined here simply as data points falling more than 2 standard deviations away from the mean of that specific variable. The function writes a report to a text file (`outliers_{suffix}.txt`) in the specified `output_dir`. The report for each variable includes: - The variable name. - Mean and standard deviation. - A list of potential outliers identified by the standard deviation method, showing their index, value, and number of standard deviations from the mean. - The 10 smallest values and their indices. - The 10 largest values and their indices. **Parameters:** - `input_data` (`InputData`): The `InputData` object containing the data to analyse. - `output_dir` (`str`): The path to the directory where the output report file will be saved. - `suffix` (`str`): A suffix to append to the output filename (e.g., `"raw"` or `"processed"`) to distinguish reports generated at different stages. **Returns:** - `None`: The function writes the report to a file. --- ### `remove_outliers_from_input` ```python def remove_outliers_from_input( input_data: InputData, media_data_outliers: dict, extra_features_outliers: dict, target_outliers: list, removal_type: str ) -> InputData: ``` Creates a *new* `InputData` object where specified outliers in media, extra features, and target data are replaced with a calculated value based on the `removal_type`. This function takes dictionaries mapping variable names to lists of outlier indices and a list of outlier indices for the target. It then uses the `InputData.clone_with_data_edits` mechanism, passing a specialized editor function (`_replace_outlier_editor_func`) to perform the replacements on copies of the data arrays. **Parameters:** - `input_data` (`InputData`): The original `InputData` instance containing the outliers. - `media_data_outliers` (`dict`): A dictionary where keys are media channel names (from `input_data.media_names`) and values are lists of integer indices representing the time points identified as outliers for that channel. - `extra_features_outliers` (`dict`): A dictionary where keys are extra feature names (from `input_data.extra_features_names`) and values are lists of integer indices representing outlier time points for that feature. - `target_outliers` (`list`): A list of integer indices representing outlier time points for the target variable (`input_data.target_data`). - `removal_type` (`str`): A constant specifying the method for calculating the replacement value. Must be one of: - `cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_TRIMMED_MEAN`: Replace outliers with the 10% trimmed mean of the non-outlier data for that variable. - `cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_P10_VALUE`: Replace outliers with the 10th percentile value of the non-outlier data for that variable. **Returns:** - `InputData`: A new `InputData` instance with the specified outliers replaced. The original `input_data` object remains unchanged. **Raises:** - `AssertionError`: If `removal_type` is not one of the recognized constants. --- *(Private helper functions `_compute_outlier_replacement_value` and `_replace_outlier_editor_func` are used internally to calculate the replacement value and perform the data editing within the cloning process, respectively.)*