Data Preparation Guide

Correctly formatting your input data is crucial for Abacus to run successfully and produce meaningful results. This guide explains the expected structure and content for the input CSV file.

Required Columns

  • Date Column:

    • Contains dates or timestamps for each observation.

    • The column name is specified by date_col in the configuration file (defaults to “date”).

    • Format should be parsable by pandas, ideally YYYY-MM-DD.

  • Target Column:

    • The metric the model aims to predict (e.g., sales, conversions, sign-ups).

    • The column name is specified by target_col in the configuration.

    • Must be numeric (integer or float).

  • Media Columns: For each marketing channel defined in the media section of the configuration:

    • Volume/Impressions Column: Represents the exposure metric (e.g., impressions, clicks, GRPs). This column’s values are transformed by adstock and saturation in the core model. The name is specified by impressions_col for each channel in the configuration. Must be numeric.

    • Spend/Cost Column: Represents the cost associated with the volume/impressions. This is primarily used for calculating Return on Investment (ROI) after the model is fitted. The name is specified by spend_col for each channel in the configuration. Must be numeric.

Optional Columns

  • Control Variable Columns:

    • Any additional factors that might influence the target variable (e.g., competitor activity, promotions, economic indicators, seasonality indicators not covered by Fourier modes).

    • Column names are listed in the control_columns list in the configuration.

    • Must be numeric. Categorical variables need to be numerically encoded (e.g., one-hot encoding, dummy variables) before being included in the input CSV.

  • Ignored Columns: Columns listed in ignore_cols in the configuration are not used by the model.

Data Format Notes

  • Granularity: Data must be consistent at the granularity specified by raw_data_granularity in the configuration (e.g., “daily” or “weekly” observations). All time series must align to this granularity.

  • Missing Values: Abacus does not perform automatic imputation. You must handle missing values (e.g., using imputation techniques appropriate for your data, or removing affected periods) before providing the data to the model. Ensure all required columns are free of NaNs or nulls.

  • Time Series: Data must be sorted chronologically by the date column. Ensure there are no gaps in the time sequence for the chosen granularity.

Example CSV Structure

Here’s a simplified example of how your CSV might look for a daily model with TV and Radio channels, plus a ‘promotion’ control variable:

date,KPI,tv_impressions,tv_spend,radio_impressions,radio_spend,promotion
2023-01-01,1500,50000,250,100000,100,0
2023-01-02,1550,52000,260,105000,105,0
2023-01-03,1600,51000,255,102000,102,1
2023-01-04,1580,49000,245,98000,98,1
2023-01-05,1520,53000,265,110000,110,0
...

Note: The actual column names (KPI, tv_impressions, tv_spend, etc.) must match those specified in your configuration file.