# Abacus Preprocessing: Decode (`decode.py`) This module provides functions for reading and parsing raw input data files (CSV or Excel format) containing time series data for marketing mix modelling. It handles date indexing, data filtering, optional seasonality decomposition using Prophet, and initial transformations based on the user's configuration. ::: abacus.prepro.decode ## Functions ### `parse_csv_generic` ```python def parse_csv_generic( data_fname: str, config: dict, holidays_filename: Optional[str] ) -> dict: ``` Parses a CSV or Excel file containing time series data and returns a structured dictionary suitable for input into `abacus.prepro.convert.transform_input_generic`. **Steps:** 1. Calls the internal `_parse_csv_shared` function to read the data file (`data_fname`), handle date indexing (using `config['date_col']`), filter rows based on `config['data_rows']` settings (total rows, start/end dates, or last N rows), optionally drop columns specified in `config['ignore_cols']`, read optional holiday data (`holidays_filename`), and potentially apply Prophet decomposition (`abacus.prepro.seas.prophet_decomp`) if configured. It also applies feature impact adjustments (e.g., multiplying by -1 for negative impact) based on `config['extra_features_impact']`. 2. Extracts the date strings from the processed DataFrame's index. 3. Creates a dictionary (`metric_dict`) where keys are the remaining column names and values are NumPy arrays of the corresponding data. 4. Constructs and returns the final dictionary containing: - `cnst.KEY_GRANULARITY`: Value from `config['raw_data_granularity']`. - `cnst.KEY_OBSERVATIONS`: Number of rows in the processed DataFrame. - `cnst.KEY_DATE_STRS`: NumPy array of date strings. - `cnst.KEY_METRICS`: The `metric_dict` created in step 3. **Parameters:** - `data_fname` (`str`): Full path to the raw data file (must be `.csv` or `.xlsx`). - `config` (`dict`): The configuration dictionary loaded from the YAML file. Used for date column name, row filtering, columns to ignore, Prophet settings, feature impacts, etc. - `holidays_filename` (`Optional[str]`): Full path to an optional file containing holiday data (CSV or Excel), used by Prophet decomposition. **Returns:** - `dict`: A dictionary containing the parsed and initially processed data, structured for use by `abacus.prepro.convert.transform_input_generic`. --- ### `csv_to_df_generic` ```python def csv_to_df_generic( data_fname: str, config: dict, keep_ignore_cols: bool = False ) -> pd.DataFrame: ``` Parses a CSV or Excel file and returns the data directly as a pandas DataFrame, primarily for exploratory purposes or workflows not using the standard `InputData` conversion. **Steps:** 1. Calls the internal `_parse_csv_shared` function similarly to `parse_csv_generic`, but passes `holidays_filename=None` (as Prophet decomposition is typically handled later if needed for direct DataFrame usage) and uses the `keep_ignore_cols` parameter. 2. Sets the DataFrame's index to a `pd.DatetimeIndex` with a daily frequency (`freq="D"`). This enforces that the input data must be daily and contiguous; pandas will raise an error if dates are missing. **Parameters:** - `data_fname` (`str`): Full path to the raw data file (must be `.csv` or `.xlsx`). - `config` (`dict`): The configuration dictionary loaded from the YAML file. Used for date column name, row filtering, and columns to ignore. - `keep_ignore_cols` (`bool`, optional): If `True`, columns listed in `config['ignore_cols']` are kept in the returned DataFrame. If `False` (default), they are dropped. **Returns:** - `pd.DataFrame`: A pandas DataFrame containing the parsed and filtered data, indexed by date with daily frequency. --- *(Private helper function `_parse_csv_shared` contains the common logic for reading files, filtering rows/columns, handling holidays, applying Prophet decomposition, and adjusting feature impacts based on the configuration.)*