Abacus Preprocessing: Convert (convert.py)

This module provides the core function for transforming raw input data, typically parsed from a CSV file by functions in abacus.prepro.decode, into the standardised abacus.prepro.input_data.InputData object required by the Abacus modelling components.

Functions

transform_input_generic

def transform_input_generic(data_dict: dict, config: dict) -> InputData:

Transforms a dictionary of raw input data into a structured InputData object based on a configuration dictionary.

This function takes the raw data (usually representing columns read from a file) and the user’s configuration settings, then maps and organises this data into the various NumPy arrays expected by the InputData class.

Steps:

  1. Initialises NumPy arrays with the correct shapes for media_data, media_costs, media_cost_priors, learned_media_priors, media_costs_by_row, and extra_features_data based on the number of observations (data_dict[cnst.KEY_OBSERVATIONS]) and the number of media channels and extra features defined in the config.

  2. Iterates through the media channels defined in config['media']:

    • Extracts the corresponding impression and spend columns from data_dict[cnst.KEY_METRICS].

    • Copies impression values into the correct channel column of media_data (using _copy_metric_values_to_media_data).

    • Copies spend values into media_costs_by_row and calculates the total media_costs per channel (using _copy_cost_values_to_media_costs).

    • Sets learned_media_priors and media_cost_priors based on values specified in the config for the channel, defaulting media_cost_priors to the total calculated cost if no fixed prior is given.

    • Collects display names for channels.

    • Keeps track of processed columns.

  3. Iterates through the remaining columns in data_dict[cnst.KEY_METRICS]:

    • Identifies the target column (config['target_col']) and copies its data into target_data.

    • Identifies extra feature columns (config['extra_features_cols']) and copies their data into the corresponding columns of extra_features_data.

  4. Checks if any columns from the input data were not accounted for (i.e., not defined as media, target, or extra features in the config) and logs an error if found.

  5. Constructs and returns an InputData object using the populated NumPy arrays, names, date strings (ensured to be a NumPy array), time granularity, and target settings from the config.

Parameters:

  • data_dict (dict): A dictionary containing the raw input data, typically the output of abacus.prepro.decode.parse_csv_generic. Expected keys include cnst.KEY_METRICS (a dictionary of column names to data arrays/Series), cnst.KEY_OBSERVATIONS (number of rows), cnst.KEY_DATE_STRS (list or array of date strings), and cnst.KEY_GRANULARITY.

  • config (dict): The configuration dictionary loaded from the user’s YAML file. Used to identify media channels, target column, extra features, priors, etc.

Returns:

  • InputData: An instance of the InputData class containing the structured and typed data ready for further preprocessing and modelling.


(Private helper functions _copy_metric_values_to_media_data and _copy_cost_values_to_media_costs are used internally to populate the NumPy arrays correctly.)