Abacus Preprocessing: Decode (decode.py)¶
This module provides functions for reading and parsing raw input data files (CSV or Excel format) containing time series data for marketing mix modelling. It handles date indexing, data filtering, optional seasonality decomposition using Prophet, and initial transformations based on the user’s configuration.
Functions
parse_csv_generic
def parse_csv_generic(
data_fname: str, config: dict, holidays_filename: Optional[str]
) -> dict:
Parses a CSV or Excel file containing time series data and returns a structured dictionary suitable for input into abacus.prepro.convert.transform_input_generic.
Steps:
Calls the internal
_parse_csv_sharedfunction to read the data file (data_fname), handle date indexing (usingconfig['date_col']), filter rows based onconfig['data_rows']settings (total rows, start/end dates, or last N rows), optionally drop columns specified inconfig['ignore_cols'], read optional holiday data (holidays_filename), and potentially apply Prophet decomposition (abacus.prepro.seas.prophet_decomp) if configured. It also applies feature impact adjustments (e.g., multiplying by -1 for negative impact) based onconfig['extra_features_impact'].Extracts the date strings from the processed DataFrame’s index.
Creates a dictionary (
metric_dict) where keys are the remaining column names and values are NumPy arrays of the corresponding data.Constructs and returns the final dictionary containing:
cnst.KEY_GRANULARITY: Value fromconfig['raw_data_granularity'].cnst.KEY_OBSERVATIONS: Number of rows in the processed DataFrame.cnst.KEY_DATE_STRS: NumPy array of date strings.cnst.KEY_METRICS: Themetric_dictcreated in step 3.
Parameters:
data_fname(str): Full path to the raw data file (must be.csvor.xlsx).config(dict): The configuration dictionary loaded from the YAML file. Used for date column name, row filtering, columns to ignore, Prophet settings, feature impacts, etc.holidays_filename(Optional[str]): Full path to an optional file containing holiday data (CSV or Excel), used by Prophet decomposition.
Returns:
dict: A dictionary containing the parsed and initially processed data, structured for use byabacus.prepro.convert.transform_input_generic.
csv_to_df_generic
def csv_to_df_generic(
data_fname: str, config: dict, keep_ignore_cols: bool = False
) -> pd.DataFrame:
Parses a CSV or Excel file and returns the data directly as a pandas DataFrame, primarily for exploratory purposes or workflows not using the standard InputData conversion.
Steps:
Calls the internal
_parse_csv_sharedfunction similarly toparse_csv_generic, but passesholidays_filename=None(as Prophet decomposition is typically handled later if needed for direct DataFrame usage) and uses thekeep_ignore_colsparameter.Sets the DataFrame’s index to a
pd.DatetimeIndexwith a daily frequency (freq="D"). This enforces that the input data must be daily and contiguous; pandas will raise an error if dates are missing.
Parameters:
data_fname(str): Full path to the raw data file (must be.csvor.xlsx).config(dict): The configuration dictionary loaded from the YAML file. Used for date column name, row filtering, and columns to ignore.keep_ignore_cols(bool, optional): IfTrue, columns listed inconfig['ignore_cols']are kept in the returned DataFrame. IfFalse(default), they are dropped.
Returns:
pd.DataFrame: A pandas DataFrame containing the parsed and filtered data, indexed by date with daily frequency.
(Private helper function _parse_csv_shared contains the common logic for reading files, filtering rows/columns, handling holidays, applying Prophet decomposition, and adjusting feature impacts based on the configuration.)