Abacus Diagnostics: Input Validator (input_validator.py)¶
This module provides a collection of functions designed to validate input data (typically in a pandas DataFrame) before it is used for preprocessing or model fitting in Abacus. These checks help identify common data quality issues early in the workflow.
Functions
check_nans
def check_nans(
dataframe: pd.DataFrame, target_col: str, media_cols: list, control_cols: list
) -> None:
Checks for the presence of any NaN (Not a Number) values within the specified target, media, and control columns of the input DataFrame.
Parameters:
dataframe(pd.DataFrame): The DataFrame to check.target_col(str): The name of the target variable column.media_cols(list): A list of column names corresponding to media channels.control_cols(list): A list of column names corresponding to control/extra features.
Returns:
None: Prints a success message if no NaNs are found.
Raises:
ValueError: If NaN values are detected in any of the specified columns, listing the affected columns.
check_duplicate_columns
def check_duplicate_columns(dataframe: pd.DataFrame) -> None:
Checks if the input DataFrame contains any duplicate column names.
Parameters:
dataframe(pd.DataFrame): The DataFrame to check.
Returns:
None: Prints a success message if no duplicate columns are found.
Raises:
ValueError: If duplicate column names are detected, listing the duplicates.
check_date_column
def check_date_column(date_series: pd.Series, config: Dict) -> None:
Performs several validation checks on the date column Series.
Checks:
Type Conversion: Attempts to convert the series to datetime objects if not already in datetime format.
Sorting: Verifies that the dates are monotonically increasing (chronologically sorted).
Frequency: Tries to infer the time series frequency (e.g., daily ‘D’, weekly ‘W-MON’). Allows constant differences if frequency inference fails but warns otherwise.
Gaps: If a frequency is successfully inferred, checks for missing dates within the expected range.
Week Start Day (for weekly data): If a weekly frequency is inferred, checks if all dates fall on the same day of the week.
Parameters:
date_series(pd.Series): The Series representing the date column.config(Dict): The configuration dictionary (currently unused in the function logic, but passed for potential future use, e.g., specifying expected start day for weekly data).
Returns:
None: Prints success messages for each passed check.
Raises:
ValueError: If any of the checks fail (e.g., not sorted, cannot infer frequency and differences are not constant, gaps detected, inconsistent week start day).
check_column_variance
def check_column_variance(
dataframe: pd.DataFrame, columns: list, check_zeros_only: bool = False
) -> None:
Checks specified numeric columns for zero variance (i.e., all values are constant) or, optionally, if all values are strictly zero.
This helps identify columns that provide no information for the model or might cause issues during scaling or modelling (e.g., division by zero standard deviation).
Parameters:
dataframe(pd.DataFrame): The DataFrame containing the columns to check.columns(list): A list of column names to validate.check_zeros_only(bool, optional):If
False(default): Raises an error if any specified column has zero variance (all values are the same, including all NaNs if not caught bycheck_nansearlier).If
True: Raises an error only if all values in a specified column are exactly zero.
Returns:
None: Prints a success message if all checked columns pass the variance criteria.
Raises:
ValueError: If any specified column fails the check (either zero variance or all zeros, depending oncheck_zeros_only), listing the problematic columns.