Abacus Preprocessing: Decode (decode.py)

This module provides functions for reading and parsing raw input data files (CSV or Excel format) containing time series data for marketing mix modelling. It handles date indexing, data filtering, optional seasonality decomposition using Prophet, and initial transformations based on the user’s configuration.

Functions

parse_csv_generic

def parse_csv_generic(
    data_fname: str, config: dict, holidays_filename: Optional[str]
) -> dict:

Parses a CSV or Excel file containing time series data and returns a structured dictionary suitable for input into abacus.prepro.convert.transform_input_generic.

Steps:

  1. Calls the internal _parse_csv_shared function to read the data file (data_fname), handle date indexing (using config['date_col']), filter rows based on config['data_rows'] settings (total rows, start/end dates, or last N rows), optionally drop columns specified in config['ignore_cols'], read optional holiday data (holidays_filename), and potentially apply Prophet decomposition (abacus.prepro.seas.prophet_decomp) if configured. It also applies feature impact adjustments (e.g., multiplying by -1 for negative impact) based on config['extra_features_impact'].

  2. Extracts the date strings from the processed DataFrame’s index.

  3. Creates a dictionary (metric_dict) where keys are the remaining column names and values are NumPy arrays of the corresponding data.

  4. Constructs and returns the final dictionary containing:

    • cnst.KEY_GRANULARITY: Value from config['raw_data_granularity'].

    • cnst.KEY_OBSERVATIONS: Number of rows in the processed DataFrame.

    • cnst.KEY_DATE_STRS: NumPy array of date strings.

    • cnst.KEY_METRICS: The metric_dict created in step 3.

Parameters:

  • data_fname (str): Full path to the raw data file (must be .csv or .xlsx).

  • config (dict): The configuration dictionary loaded from the YAML file. Used for date column name, row filtering, columns to ignore, Prophet settings, feature impacts, etc.

  • holidays_filename (Optional[str]): Full path to an optional file containing holiday data (CSV or Excel), used by Prophet decomposition.

Returns:

  • dict: A dictionary containing the parsed and initially processed data, structured for use by abacus.prepro.convert.transform_input_generic.


csv_to_df_generic

def csv_to_df_generic(
    data_fname: str, config: dict, keep_ignore_cols: bool = False
) -> pd.DataFrame:

Parses a CSV or Excel file and returns the data directly as a pandas DataFrame, primarily for exploratory purposes or workflows not using the standard InputData conversion.

Steps:

  1. Calls the internal _parse_csv_shared function similarly to parse_csv_generic, but passes holidays_filename=None (as Prophet decomposition is typically handled later if needed for direct DataFrame usage) and uses the keep_ignore_cols parameter.

  2. Sets the DataFrame’s index to a pd.DatetimeIndex with a daily frequency (freq="D"). This enforces that the input data must be daily and contiguous; pandas will raise an error if dates are missing.

Parameters:

  • data_fname (str): Full path to the raw data file (must be .csv or .xlsx).

  • config (dict): The configuration dictionary loaded from the YAML file. Used for date column name, row filtering, and columns to ignore.

  • keep_ignore_cols (bool, optional): If True, columns listed in config['ignore_cols'] are kept in the returned DataFrame. If False (default), they are dropped.

Returns:

  • pd.DataFrame: A pandas DataFrame containing the parsed and filtered data, indexed by date with daily frequency.


(Private helper function _parse_csv_shared contains the common logic for reading files, filtering rows/columns, handling holidays, applying Prophet decomposition, and adjusting feature impacts based on the configuration.)