Abacus Preprocessing: Outliers (outliers.py)ΒΆ

This module provides functions for identifying and handling potential outliers within the InputData object before model fitting. It allows for reporting potential outliers and creating a modified dataset where outliers are replaced according to specified methods.

Functions

def print_outliers(input_data: InputData, output_dir: str, suffix: str) -> None:

Identifies and reports potential outliers for each media channel, extra feature, and the target variable within the provided InputData.

Outliers are defined here simply as data points falling more than 2 standard deviations away from the mean of that specific variable. The function writes a report to a text file (outliers_{suffix}.txt) in the specified output_dir.

The report for each variable includes:

  • The variable name.

  • Mean and standard deviation.

  • A list of potential outliers identified by the standard deviation method, showing their index, value, and number of standard deviations from the mean.

  • The 10 smallest values and their indices.

  • The 10 largest values and their indices.

Parameters:

  • input_data (InputData): The InputData object containing the data to analyse.

  • output_dir (str): The path to the directory where the output report file will be saved.

  • suffix (str): A suffix to append to the output filename (e.g., "raw" or "processed") to distinguish reports generated at different stages.

Returns:

  • None: The function writes the report to a file.


remove_outliers_from_input

def remove_outliers_from_input(
    input_data: InputData,
    media_data_outliers: dict,
    extra_features_outliers: dict,
    target_outliers: list,
    removal_type: str
) -> InputData:

Creates a new InputData object where specified outliers in media, extra features, and target data are replaced with a calculated value based on the removal_type.

This function takes dictionaries mapping variable names to lists of outlier indices and a list of outlier indices for the target. It then uses the InputData.clone_with_data_edits mechanism, passing a specialized editor function (_replace_outlier_editor_func) to perform the replacements on copies of the data arrays.

Parameters:

  • input_data (InputData): The original InputData instance containing the outliers.

  • media_data_outliers (dict): A dictionary where keys are media channel names (from input_data.media_names) and values are lists of integer indices representing the time points identified as outliers for that channel.

  • extra_features_outliers (dict): A dictionary where keys are extra feature names (from input_data.extra_features_names) and values are lists of integer indices representing outlier time points for that feature.

  • target_outliers (list): A list of integer indices representing outlier time points for the target variable (input_data.target_data).

  • removal_type (str): A constant specifying the method for calculating the replacement value. Must be one of:

    • cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_TRIMMED_MEAN: Replace outliers with the 10% trimmed mean of the non-outlier data for that variable.

    • cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_P10_VALUE: Replace outliers with the 10th percentile value of the non-outlier data for that variable.

Returns:

  • InputData: A new InputData instance with the specified outliers replaced. The original input_data object remains unchanged.

Raises:

  • AssertionError: If removal_type is not one of the recognized constants.


(Private helper functions _compute_outlier_replacement_value and _replace_outlier_editor_func are used internally to calculate the replacement value and perform the data editing within the cloning process, respectively.)