Input Data (input_data.py)

This module defines the InputData class, which encapsulates all the necessary data structures required for the ABACUS Marketing Mix Model.

InputData

A class designed to hold and manage the various data components fed into the MMM, including media metrics, cost data, control features, and target sales/conversion data.

class InputData:
    """
    Encapsulation of data fed into the marketing mix model - both the marketing metrics and the sales metrics.

    All 2-dimensional arrays have time (day or week number) as the first index and channel as the second index.
    All numbers are numpy.uint64, and all arrays of numbers are numpy arrays or ndarrays.

    All values are true values (i.e., not scaled down for feeding into the MMM).
    """

Initialization

    def __init__(
        self,
        date_strs: np.ndarray,
        time_granularity: str,
        media_data: np.ndarray,
        media_costs: np.ndarray,
        media_costs_by_row: np.ndarray,
        media_cost_priors: np.ndarray,
        learned_media_priors: np.ndarray,
        media_names: List[str],
        extra_features_data: np.ndarray,
        extra_features_names: List[str],
        target_data: np.ndarray,
        target_is_log_scale: bool,
        target_name: str,
    ) -> None:

Initialises the InputData object. It internally calls _validate to ensure data consistency and correctness.

Parameters:

  • date_strs (np.ndarray): 1-D NumPy array of labels (typically date strings) for each time series data point.

  • time_granularity (str): String constant describing the time granularity (e.g., cnst.GRANULARITY_DAILY, cnst.GRANULARITY_WEEKLY).

  • media_data (np.ndarray): 2-D NumPy array of float64 media data values (e.g., impressions, clicks) with shape [time, channel].

  • media_costs (np.ndarray): 1-D NumPy array of float64 total media costs per channel with shape [channel].

  • media_costs_by_row (np.ndarray): 2-D NumPy array of float64 media costs per observation (e.g., daily/weekly spend) with shape [time, channel].

  • media_cost_priors (np.ndarray): 1-D NumPy array of float64 priors for media cost per unit (e.g., Cost Per Mille/Click) with shape [channel]. Must be > 0 unless a learned_media_prior is provided.

  • learned_media_priors (np.ndarray): 1-D NumPy array of float64 learned media priors with shape [channel]. These override media_cost_priors if greater than 0.

  • media_names (List[str]): List of names corresponding to the media channels.

  • extra_features_data (np.ndarray): 2-D NumPy array of float64 values for additional control features (e.g., competitor activity, promotions) with shape [time, feature]. Can be empty if no extra features are used.

  • extra_features_names (List[str]): List of names corresponding to the extra features.

  • target_data (np.ndarray): 1-D NumPy array of float64 target metric values (e.g., sales, conversions) with shape [time].

  • target_is_log_scale (bool): Flag indicating if the target_data is already log-transformed.

  • target_name (str): Name of the target metric.

Attributes Initialised:

All parameters listed above are stored as instance attributes with the same names.

Static Methods

_validate

    @staticmethod
    def _validate(
        date_strs: np.ndarray,
        time_granularity: str,
        media_data: np.ndarray,
        media_costs: np.ndarray,
        media_cost_priors: np.ndarray,
        learned_media_priors: np.ndarray,
        media_names: List[str],
        extra_features_data: np.ndarray,
        extra_features_names: List[str],
        target_data: np.ndarray,
        target_name: str,
    ) -> None:

Internal static method used to validate the dimensions, types, and consistency of the data provided during __init__. Raises AssertionError if validation fails.

clone_with_data_edits

    @staticmethod
    def clone_with_data_edits(
        input_data: InputData,
        editor_func: Callable,
        context: Any
    ) -> InputData:

Creates a deep copy (clone) of an existing InputData instance, allowing modifications to the data arrays via a provided function.

Parameters:

  • input_data (InputData): The original InputData instance to clone.

  • editor_func (Callable): A function that accepts (context, date_strs, media_data, extra_features_data, target_data) and modifies the data arrays in place.

  • context (Any): Client context passed through to the editor_func.

Returns:

  • InputData: A new InputData instance containing the modified data.

_sanitize_name

    @staticmethod
    def _sanitize_name(media_name: str) -> str:

Internal static utility method to sanitize channel or feature names for use in filenames (e.g., during dump). Converts to lowercase, replaces spaces with underscores, and removes parentheses.

_group_by_week

    @staticmethod
    def _group_by_week(idx: int) -> int:

Internal static helper method used by clone_as_weekly to determine the weekly group index for a given daily observation index (based on groups of 7).

Instance Methods

dump

    def dump(
        self,
        output_dir: str,
        suffix: str,
        verbose: bool = True
    ) -> None:

Writes the contents of the InputData instance to several text files within the specified output directory. Useful for debugging and inspection.

Parameters:

  • output_dir (str): Path to the directory where files will be saved.

  • suffix (str): A suffix to append to the generated filenames (e.g., “raw”, “processed”).

  • verbose (bool, optional): If True, writes detailed per-observation files for dates, media, costs, features, and target. If False, only writes a summary file. Defaults to True.

Output Files (Examples):

  • data_{suffix}_summary.txt: Overview of names, costs, priors, target info.

  • data_{suffix}_dates.txt (if verbose): List of date strings.

  • data_{suffix}_{media_name}.txt (if verbose): Media data for a specific channel.

  • data_{suffix}_{media_name}_costs.txt (if verbose): Media costs per row for a channel.

  • data_{suffix}_{extra_feature_name}.txt (if verbose): Data for an extra feature.

  • data_{suffix}_target.txt (if verbose): Target data values.

clone_and_add_extra_features

    def clone_and_add_extra_features(
        self,
        feature_names: List[str],
        feature_data: np.ndarray
    ) -> InputData:

Creates a new InputData instance by copying the current one and appending new extra features.

Parameters:

  • feature_names (List[str]): A list of names for the new features to add.

  • feature_data (np.ndarray): A 2-D NumPy array containing the data for the new features, with shape [time, new_features]. Must have the same number of time observations as the existing data.

Returns:

  • InputData: A new InputData instance with the combined original and new extra features.

clone_as_weekly

    def clone_as_weekly(self) -> InputData:

Creates a new InputData instance by aggregating the current (assumed daily) data into weekly sums. It groups data into blocks of 7 days, summing media_data, media_costs_by_row, extra_features_data, and target_data. The first date of each 7-day block is used as the weekly date string. Any trailing partial week at the end of the data is discarded.

Requires:

  • The original InputData instance must have time_granularity set to cnst.GRANULARITY_DAILY.

Returns:

  • InputData: A new InputData instance with weekly aggregated data and time_granularity set to cnst.GRANULARITY_WEEKLY.

clone_and_log_transform_target_data

    def clone_and_log_transform_target_data(self) -> InputData:

Creates a new InputData instance by copying the current one and applying a natural logarithm transformation (jnp.log) to the target_data. Updates target_is_log_scale to True and appends “(log-transformed)” to the target_name.

Returns:

  • InputData: A new InputData instance with log-transformed target data.

clone_and_split_media_data

    def clone_and_split_media_data(
        self,
        channel_idx: int,
        split_obs_idx: int,
        media_before_name: str,
        media_after_name: str
    ) -> InputData:

Creates a new InputData instance where a specified media channel is split into two separate channels at a given time point. The data and costs for the original channel are divided between the new “before” and “after” channels.

Parameters:

  • channel_idx (int): The index of the existing media channel to split.

  • split_obs_idx (int): The observation index at which the split occurs. Data before this index goes into media_before_name, data at and after this index goes into media_after_name.

  • media_before_name (str): The name for the new channel representing data before the split point.

  • media_after_name (str): The name for the new channel representing data at and after the split point.

Returns:

  • InputData: A new InputData instance with the specified channel split into two, and corresponding adjustments made to names, data, costs, and priors. Note: Cost priors are scaled proportionally based on the split point, which is an approximation.