# Abacus Preprocessing: Outliers (`outliers.py`)

This module provides functions for identifying and handling potential outliers within the `InputData` object before model fitting. It allows for reporting potential outliers and creating a modified dataset where outliers are replaced according to specified methods.

::: abacus.prepro.outliers

## Functions

### `print_outliers`

```python
def print_outliers(input_data: InputData, output_dir: str, suffix: str) -> None:
```

Identifies and reports potential outliers for each media channel, extra feature, and the target variable within the provided `InputData`.

Outliers are defined here simply as data points falling more than 2 standard deviations away from the mean of that specific variable. The function writes a report to a text file (`outliers_{suffix}.txt`) in the specified `output_dir`.

The report for each variable includes:
-   The variable name.
-   Mean and standard deviation.
-   A list of potential outliers identified by the standard deviation method, showing their index, value, and number of standard deviations from the mean.
-   The 10 smallest values and their indices.
-   The 10 largest values and their indices.

**Parameters:**

-   `input_data` (`InputData`): The `InputData` object containing the data to analyse.
-   `output_dir` (`str`): The path to the directory where the output report file will be saved.
-   `suffix` (`str`): A suffix to append to the output filename (e.g., `"raw"` or `"processed"`) to distinguish reports generated at different stages.

**Returns:**

-   `None`: The function writes the report to a file.

---

### `remove_outliers_from_input`

```python
def remove_outliers_from_input(
    input_data: InputData,
    media_data_outliers: dict,
    extra_features_outliers: dict,
    target_outliers: list,
    removal_type: str
) -> InputData:
```

Creates a *new* `InputData` object where specified outliers in media, extra features, and target data are replaced with a calculated value based on the `removal_type`.

This function takes dictionaries mapping variable names to lists of outlier indices and a list of outlier indices for the target. It then uses the `InputData.clone_with_data_edits` mechanism, passing a specialized editor function (`_replace_outlier_editor_func`) to perform the replacements on copies of the data arrays.

**Parameters:**

-   `input_data` (`InputData`): The original `InputData` instance containing the outliers.
-   `media_data_outliers` (`dict`): A dictionary where keys are media channel names (from `input_data.media_names`) and values are lists of integer indices representing the time points identified as outliers for that channel.
-   `extra_features_outliers` (`dict`): A dictionary where keys are extra feature names (from `input_data.extra_features_names`) and values are lists of integer indices representing outlier time points for that feature.
-   `target_outliers` (`list`): A list of integer indices representing outlier time points for the target variable (`input_data.target_data`).
-   `removal_type` (`str`): A constant specifying the method for calculating the replacement value. Must be one of:
    -   `cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_TRIMMED_MEAN`: Replace outliers with the 10% trimmed mean of the non-outlier data for that variable.
    -   `cnst.REMOVE_OUTLIERS_TYPE_REPLACE_WITH_P10_VALUE`: Replace outliers with the 10th percentile value of the non-outlier data for that variable.

**Returns:**

-   `InputData`: A new `InputData` instance with the specified outliers replaced. The original `input_data` object remains unchanged.

**Raises:**

-   `AssertionError`: If `removal_type` is not one of the recognized constants.

---

*(Private helper functions `_compute_outlier_replacement_value` and `_replace_outlier_editor_func` are used internally to calculate the replacement value and perform the data editing within the cloning process, respectively.)*