Abacus Preprocessing: Outliers (outliers.py)ΒΆ
This module provides functions for identifying and handling potential outliers within the InputData object before model fitting. It allows for reporting potential outliers and creating a modified dataset where outliers are replaced according to specified methods.
Functions
print_outliers
def print_outliers(input_data: InputData, output_dir: str, suffix: str) -> None:
Identifies and reports potential outliers for each media channel, extra feature, and the target variable within the provided InputData.
Outliers are defined here simply as data points falling more than 2 standard deviations away from the mean of that specific variable. The function writes a report to a text file (outliers_{suffix}.txt) in the specified output_dir.
The report for each variable includes:
The variable name.
Mean and standard deviation.
A list of potential outliers identified by the standard deviation method, showing their index, value, and number of standard deviations from the mean.
The 10 smallest values and their indices.
The 10 largest values and their indices.
Parameters:
input_data(InputData): TheInputDataobject containing the data to analyse.output_dir(str): The path to the directory where the output report file will be saved.suffix(str): A suffix to append to the output filename (e.g.,"raw"or"processed") to distinguish reports generated at different stages.
Returns:
None: The function writes the report to a file.
remove_outliers_from_input
def remove_outliers_from_input(
input_data: InputData,
media_data_outliers: dict,
extra_features_outliers: dict,
target_outliers: list,
removal_type: str
) -> InputData:
Creates a new InputData object where specified outliers in media, extra features, and target data are replaced with a calculated value based on the removal_type.
This function takes dictionaries mapping variable names to lists of outlier indices and a list of outlier indices for the target. It then uses the InputData.clone_with_data_edits mechanism, passing a specialized editor function (_replace_outlier_editor_func) to perform the replacements on copies of the data arrays.
Parameters:
input_data(InputData): The originalInputDatainstance containing the outliers.media_data_outliers(dict): A dictionary where keys are media channel names (frominput_data.media_names) and values are lists of integer indices representing the time points identified as outliers for that channel.extra_features_outliers(dict): A dictionary where keys are extra feature names (frominput_data.extra_features_names) and values are lists of integer indices representing outlier time points for that feature.target_outliers(list): A list of integer indices representing outlier time points for the target variable (input_data.target_data).removal_type(str): A constant specifying the method for calculating the replacement value. Must be one of:cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_TRIMMED_MEAN: Replace outliers with the 10% trimmed mean of the non-outlier data for that variable.cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_P10_VALUE: Replace outliers with the 10th percentile value of the non-outlier data for that variable.
Returns:
InputData: A newInputDatainstance with the specified outliers replaced. The originalinput_dataobject remains unchanged.
Raises:
AssertionError: Ifremoval_typeis not one of the recognized constants.
(Private helper functions _compute_outlier_replacement_value and _replace_outlier_editor_func are used internally to calculate the replacement value and perform the data editing within the cloning process, respectively.)