Input Data (input_data.py)¶
This module defines the InputData class, which encapsulates all the necessary data structures required for the ABACUS Marketing Mix Model.
InputData¶
A class designed to hold and manage the various data components fed into the MMM, including media metrics, cost data, control features, and target sales/conversion data.
class InputData:
"""
Encapsulation of data fed into the marketing mix model - both the marketing metrics and the sales metrics.
All 2-dimensional arrays have time (day or week number) as the first index and channel as the second index.
All numbers are numpy.uint64, and all arrays of numbers are numpy arrays or ndarrays.
All values are true values (i.e., not scaled down for feeding into the MMM).
"""
Initialization¶
def __init__(
self,
date_strs: np.ndarray,
time_granularity: str,
media_data: np.ndarray,
media_costs: np.ndarray,
media_costs_by_row: np.ndarray,
media_cost_priors: np.ndarray,
learned_media_priors: np.ndarray,
media_names: List[str],
extra_features_data: np.ndarray,
extra_features_names: List[str],
target_data: np.ndarray,
target_is_log_scale: bool,
target_name: str,
) -> None:
Initialises the InputData object. It internally calls _validate to ensure data consistency and correctness.
Parameters:
date_strs(np.ndarray): 1-D NumPy array of labels (typically date strings) for each time series data point.time_granularity(str): String constant describing the time granularity (e.g.,cnst.GRANULARITY_DAILY,cnst.GRANULARITY_WEEKLY).media_data(np.ndarray): 2-D NumPy array of float64 media data values (e.g., impressions, clicks) with shape[time, channel].media_costs(np.ndarray): 1-D NumPy array of float64 total media costs per channel with shape[channel].media_costs_by_row(np.ndarray): 2-D NumPy array of float64 media costs per observation (e.g., daily/weekly spend) with shape[time, channel].media_cost_priors(np.ndarray): 1-D NumPy array of float64 priors for media cost per unit (e.g., Cost Per Mille/Click) with shape[channel]. Must be > 0 unless alearned_media_prioris provided.learned_media_priors(np.ndarray): 1-D NumPy array of float64 learned media priors with shape[channel]. These overridemedia_cost_priorsif greater than 0.media_names(List[str]): List of names corresponding to the media channels.extra_features_data(np.ndarray): 2-D NumPy array of float64 values for additional control features (e.g., competitor activity, promotions) with shape[time, feature]. Can be empty if no extra features are used.extra_features_names(List[str]): List of names corresponding to the extra features.target_data(np.ndarray): 1-D NumPy array of float64 target metric values (e.g., sales, conversions) with shape[time].target_is_log_scale(bool): Flag indicating if thetarget_datais already log-transformed.target_name(str): Name of the target metric.
Attributes Initialised:
All parameters listed above are stored as instance attributes with the same names.
Static Methods¶
_validate¶
@staticmethod
def _validate(
date_strs: np.ndarray,
time_granularity: str,
media_data: np.ndarray,
media_costs: np.ndarray,
media_cost_priors: np.ndarray,
learned_media_priors: np.ndarray,
media_names: List[str],
extra_features_data: np.ndarray,
extra_features_names: List[str],
target_data: np.ndarray,
target_name: str,
) -> None:
Internal static method used to validate the dimensions, types, and consistency of the data provided during __init__. Raises AssertionError if validation fails.
clone_with_data_edits¶
@staticmethod
def clone_with_data_edits(
input_data: InputData,
editor_func: Callable,
context: Any
) -> InputData:
Creates a deep copy (clone) of an existing InputData instance, allowing modifications to the data arrays via a provided function.
Parameters:
input_data(InputData): The originalInputDatainstance to clone.editor_func(Callable): A function that accepts(context, date_strs, media_data, extra_features_data, target_data)and modifies the data arrays in place.context(Any): Client context passed through to theeditor_func.
Returns:
InputData: A newInputDatainstance containing the modified data.
_sanitize_name¶
@staticmethod
def _sanitize_name(media_name: str) -> str:
Internal static utility method to sanitize channel or feature names for use in filenames (e.g., during dump). Converts to lowercase, replaces spaces with underscores, and removes parentheses.
_group_by_week¶
@staticmethod
def _group_by_week(idx: int) -> int:
Internal static helper method used by clone_as_weekly to determine the weekly group index for a given daily observation index (based on groups of 7).
Instance Methods¶
dump¶
def dump(
self,
output_dir: str,
suffix: str,
verbose: bool = True
) -> None:
Writes the contents of the InputData instance to several text files within the specified output directory. Useful for debugging and inspection.
Parameters:
output_dir(str): Path to the directory where files will be saved.suffix(str): A suffix to append to the generated filenames (e.g., “raw”, “processed”).verbose(bool, optional): IfTrue, writes detailed per-observation files for dates, media, costs, features, and target. IfFalse, only writes a summary file. Defaults toTrue.
Output Files (Examples):
data_{suffix}_summary.txt: Overview of names, costs, priors, target info.data_{suffix}_dates.txt(if verbose): List of date strings.data_{suffix}_{media_name}.txt(if verbose): Media data for a specific channel.data_{suffix}_{media_name}_costs.txt(if verbose): Media costs per row for a channel.data_{suffix}_{extra_feature_name}.txt(if verbose): Data for an extra feature.data_{suffix}_target.txt(if verbose): Target data values.
clone_and_add_extra_features¶
def clone_and_add_extra_features(
self,
feature_names: List[str],
feature_data: np.ndarray
) -> InputData:
Creates a new InputData instance by copying the current one and appending new extra features.
Parameters:
feature_names(List[str]): A list of names for the new features to add.feature_data(np.ndarray): A 2-D NumPy array containing the data for the new features, with shape[time, new_features]. Must have the same number of time observations as the existing data.
Returns:
InputData: A newInputDatainstance with the combined original and new extra features.
clone_as_weekly¶
def clone_as_weekly(self) -> InputData:
Creates a new InputData instance by aggregating the current (assumed daily) data into weekly sums. It groups data into blocks of 7 days, summing media_data, media_costs_by_row, extra_features_data, and target_data. The first date of each 7-day block is used as the weekly date string. Any trailing partial week at the end of the data is discarded.
Requires:
The original
InputDatainstance must havetime_granularityset tocnst.GRANULARITY_DAILY.
Returns:
InputData: A newInputDatainstance with weekly aggregated data andtime_granularityset tocnst.GRANULARITY_WEEKLY.
clone_and_log_transform_target_data¶
def clone_and_log_transform_target_data(self) -> InputData:
Creates a new InputData instance by copying the current one and applying a natural logarithm transformation (jnp.log) to the target_data. Updates target_is_log_scale to True and appends “(log-transformed)” to the target_name.
Returns:
InputData: A newInputDatainstance with log-transformed target data.
clone_and_split_media_data¶
def clone_and_split_media_data(
self,
channel_idx: int,
split_obs_idx: int,
media_before_name: str,
media_after_name: str
) -> InputData:
Creates a new InputData instance where a specified media channel is split into two separate channels at a given time point. The data and costs for the original channel are divided between the new “before” and “after” channels.
Parameters:
channel_idx(int): The index of the existing media channel to split.split_obs_idx(int): The observation index at which the split occurs. Data before this index goes intomedia_before_name, data at and after this index goes intomedia_after_name.media_before_name(str): The name for the new channel representing data before the split point.media_after_name(str): The name for the new channel representing data at and after the split point.
Returns:
InputData: A newInputDatainstance with the specified channel split into two, and corresponding adjustments made to names, data, costs, and priors. Note: Cost priors are scaled proportionally based on the split point, which is an approximation.