Abacus Preprocessing: Convert (convert.py)¶
This module provides the core function for transforming raw input data, typically parsed from a CSV file by functions in abacus.prepro.decode, into the standardised abacus.prepro.input_data.InputData object required by the Abacus modelling components.
Functions
transform_input_generic
def transform_input_generic(data_dict: dict, config: dict) -> InputData:
Transforms a dictionary of raw input data into a structured InputData object based on a configuration dictionary.
This function takes the raw data (usually representing columns read from a file) and the user’s configuration settings, then maps and organises this data into the various NumPy arrays expected by the InputData class.
Steps:
Initialises NumPy arrays with the correct shapes for
media_data,media_costs,media_cost_priors,learned_media_priors,media_costs_by_row, andextra_features_databased on the number of observations (data_dict[cnst.KEY_OBSERVATIONS]) and the number of media channels and extra features defined in theconfig.Iterates through the media channels defined in
config['media']:Extracts the corresponding impression and spend columns from
data_dict[cnst.KEY_METRICS].Copies impression values into the correct channel column of
media_data(using_copy_metric_values_to_media_data).Copies spend values into
media_costs_by_rowand calculates the totalmedia_costsper channel (using_copy_cost_values_to_media_costs).Sets
learned_media_priorsandmedia_cost_priorsbased on values specified in theconfigfor the channel, defaultingmedia_cost_priorsto the total calculated cost if no fixed prior is given.Collects display names for channels.
Keeps track of processed columns.
Iterates through the remaining columns in
data_dict[cnst.KEY_METRICS]:Identifies the target column (
config['target_col']) and copies its data intotarget_data.Identifies extra feature columns (
config['extra_features_cols']) and copies their data into the corresponding columns ofextra_features_data.
Checks if any columns from the input data were not accounted for (i.e., not defined as media, target, or extra features in the
config) and logs an error if found.Constructs and returns an
InputDataobject using the populated NumPy arrays, names, date strings (ensured to be a NumPy array), time granularity, and target settings from theconfig.
Parameters:
data_dict(dict): A dictionary containing the raw input data, typically the output ofabacus.prepro.decode.parse_csv_generic. Expected keys includecnst.KEY_METRICS(a dictionary of column names to data arrays/Series),cnst.KEY_OBSERVATIONS(number of rows),cnst.KEY_DATE_STRS(list or array of date strings), andcnst.KEY_GRANULARITY.config(dict): The configuration dictionary loaded from the user’s YAML file. Used to identify media channels, target column, extra features, priors, etc.
Returns:
InputData: An instance of theInputDataclass containing the structured and typed data ready for further preprocessing and modelling.
(Private helper functions _copy_metric_values_to_media_data and _copy_cost_values_to_media_costs are used internally to populate the NumPy arrays correctly.)