Guide: Configuring the Model Run

This guide explains how to set up the YAML configuration file required to run the Abacus Media Mix Model.

Overview

The configuration file tells Abacus how to handle your data, which model parameters to use, and what components (like seasonality or control variables) to include. It uses the YAML format for readability.

Quick Start

  1. Copy an Example: Start with one of the Example Configurations below.

  2. Prepare Data: Ensure your input CSV file is ready, following the Data Preparation Guide. The path to this CSV is typically provided when running the model driver script, not within this configuration file.

  3. Set Core Options:

    • Define raw_data_granularity (daily or weekly).

    • Specify the target_col (your main outcome metric, e.g., sales, conversions).

    • List your media channels, providing display_name, impressions_col (for model input), and spend_col (for ROI calculation) for each.

  4. Specify Data Range: Use train_test_ratio or start_date/end_date under data_rows to select the data period.

  5. Add Controls/Seasonality (Optional): If needed, list control_columns or set yearly_seasonality.

  6. Run the Model: Execute the model using the driver script (e.g., demo/runme.py), pointing it to your configuration file.

  7. Iterate: Review the results and adjust configuration parameters (like priors, adstock lag, sampler settings) as needed based on model diagnostics.

Configuration File Structure

The YAML file typically includes these main sections:

  • Data Handling Options (data_rows): Controls how the input data is read and split.

  • Column Definitions: Maps column names in your CSV to their roles in the model (target_col, media, control_columns, ignore_cols).

  • Model Parameters: Sets parameters influencing model fitting and performance (sampler settings, adstock, seasonality).

  • Custom Priors (custom_priors): (Optional) Allows specifying custom Bayesian priors for model parameters.

(For a detailed explanation of every parameter, see the Configuration Reference.)

Key Sections Explained

Data Handling Options (data_rows)

  • Use train_test_ratio (e.g., 0.8 for 80% train, 20% test) for simple time-based splits. A value of 1.0 uses all data for training (no hold-out set).

  • Alternatively, use start_date and end_date (YYYY-MM-DD format) to select a specific time window.

Column Definitions

  • date_col: Essential. Specifies the name of the date column in your input CSV (defaults to “date” if omitted).

  • target_col: Crucial. The main metric you want the model to predict/explain.

  • media: Define each marketing channel here. impressions_col (volume/exposure) is used in the core model transformation, while spend_col is needed later to calculate ROI. display_name is for reporting.

  • control_columns: Include external factors (e.g., competitor spend, promotions, economic indicators) that might influence the target variable but aren’t marketing channels being modelled directly. These must be numeric (encode categorical variables beforehand). Prophet-generated components are also added here automatically if enabled (see Prophet Integration (prophet)).

  • extra_features_impact: (Optional) A dictionary mapping control column names to 'negative'. If specified, the values in these columns will be multiplied by -1 during data loading. Useful if a feature inherently has an inverse relationship with the target (e.g., competitor spend). Example: extra_features_impact: {competitor_spend: 'negative'}.

  • ignore_cols: List any columns from your CSV that should be completely ignored by the model.

Model Parameters

  • Sampler Settings (tune, draws, chains, target_accept): These control the MCMC sampling process. See Performance Guidelines for recommendations based on data size. Tuning these affects runtime and the reliability of the results. Check convergence diagnostics after running.

  • adstock_max_lag: Defines the maximum carry-over period for marketing effects (in units of your raw_data_granularity). Choose based on expected duration of advertising impact (e.g., 4-12 weeks for weekly data, 7-28 days for daily data).

  • yearly_seasonality: (Optional) Integer specifying the number of Fourier modes to include directly in the PyMC model for modelling yearly seasonality (e.g., 3 to 6). If omitted or null, no explicit Fourier seasonality component is added (though seasonality might be captured by control variables or Prophet components). See also Prophet Integration (prophet).

Prophet Integration (prophet)

(Optional) This section configures the use of Meta’s Prophet library for automatic seasonality and holiday decomposition. If this section is included, Prophet runs first, and its generated components are added as features for the main PyMC model.

  • holiday_country: Specify the country code (e.g., ‘US’, ‘GB’) to automatically include built-in holidays for that country.

  • include_holidays: Set to true if you provide a separate holiday file (path specified when running the model driver) and want Prophet to use it. The country setting still applies for formatting.

  • daily_seasonality, weekly_seasonality, yearly_seasonality: Set these to true within the prophet block to enable Prophet’s decomposition for these specific seasonalities.

  • Automatic Feature Addition: If any Prophet components are enabled (trend is implicitly always generated if Prophet runs), corresponding columns (‘trend’, ‘daily’, ‘weekly’, ‘yearly’, ‘holidays’) are automatically added to the control_columns list passed to the PyMC model.

Note: The top-level yearly_seasonality parameter controls Fourier terms added directly to the PyMC model, independent of Prophet’s yearly seasonality decomposition. You can use either, both, or neither.

Custom Priors (custom_priors)

This advanced section allows you to override the default Bayesian priors. Use this if you have strong prior beliefs about parameter values based on domain knowledge or previous studies. See the Prior Selection Guidelines below and the Configuration Reference for distribution options and formatting. Always check the impact of custom priors on model diagnostics.

Performance Guidelines

Adjust sampler settings based on your dataset size:

  • Small (< 100 rows): tune: 1000, draws: 1000, chains: 2, target_accept: 0.8

  • Medium (100-500 rows): tune: 2000, draws: 2000, chains: 4, target_accept: 0.95

  • Large (> 500 rows): tune: 4000, draws: 4000, chains: 4 (or more depending on cores), target_accept: 0.95

Be mindful of memory usage, especially when increasing draws and chains.

Prior Selection Guidelines

  • intercept: Often LogNormal if target > 0, Normal otherwise. Scale based on target magnitude.

  • beta_channel (Effectiveness): HalfNormal is common (assumes positive effect). Adjust sigma based on expected effect range.

  • alpha (Adstock Decay): Beta distribution (constrains between 0-1). Higher alpha -> faster decay.

  • lam (Saturation): Gamma or HalfNormal. Controls how quickly saturation occurs.

  • likelihood['kwargs']['sigma'] (Error Term): HalfNormal is common. Represents unexplained variance.

  • gamma_control / gamma_fourier (Fourier modes from yearly_seasonality): Often Normal or Laplace (allows positive/negative effects). Adjust scale (sigma or b) based on expected impact size.

(Refer to PyMC documentation for details on specific distributions and their parameters.)

Troubleshooting

  • Convergence Issues: Increase tune/draws, adjust target_accept, check data quality, review priors.

  • Memory Errors: Reduce draws/chains.

  • Poor Fit: Check data quality, add relevant control_columns, adjust priors, verify yearly_seasonality setting.

Example Configurations

Basic Weekly Model

# --- Data Handling ---
raw_data_granularity: weekly
data_rows:
  train_test_ratio: 0.9 # 90% train, 10% test

# --- Column Definitions ---
target_col: sales
media:
  - display_name: TV
    impressions_col: tv_impressions
    spend_col: tv_spend
  - display_name: Radio
    impressions_col: radio_impressions
    spend_col: radio_spend
# control_columns: null # Explicitly none
# ignore_cols: null

# --- Model Parameters ---
# adstock_max_lag: 4 # Example: 4 weeks lag
# yearly_seasonality: null # Explicitly none
# tune: 1000 # Example sampler settings
# draws: 1000
# chains: 2
# target_accept: 0.8

# --- Custom Priors ---
# custom_priors: null # Using default priors

Daily Model with Controls and Seasonality

# --- Data Handling ---
raw_data_granularity: daily
data_rows:
  start_date: 2022-01-01
  end_date: 2023-12-31

# --- Column Definitions ---
target_col: conversions
media:
  - display_name: Facebook Ads
    impressions_col: fb_impressions
    spend_col: fb_cost
  - display_name: Google Search
    impressions_col: search_clicks # Using clicks as proxy
    spend_col: search_cost
control_columns:
  - competitor_promo_active
  - is_holiday # Manually created holiday flag
# ignore_cols: null

# --- Model Parameters ---
adstock_max_lag: 14 # 2 weeks for daily data
yearly_seasonality: 3 # Include 3 Fourier modes for yearly seasonality
tune: 2000
draws: 2000
chains: 4
target_accept: 0.9

# --- Custom Priors ---
# custom_priors: null # Using default priors