Explanation: Leave-One-Out Cross-Validation (LOO-CV)¶

This document explains the concept of Leave-One-Out Cross-Validation (LOO-CV) and the key metrics used for model evaluation and comparison in Bayesian analysis, particularly as implemented by ArviZ.

What is Cross-Validation?¶

Cross-validation is a set of techniques used to estimate how well a model predicts unseen data. It involves using model fits on subsets of data to predict the remaining data.

Uses of Cross-Validation:¶

Assess the predictive performance of a model.
Detect model misspecification or calibration issues.
Compare multiple models.
Select the best model from multiple candidates.
Combine predictions from multiple models.

Using Cross-Validation for a Single Model¶

Primary Reasons:

To evaluate a model’s ability to make future predictions.
To check if the model adequately describes the observed data without necessarily making future predictions.

Detailed Cases:¶

Predictive Assessment: Evaluating how well the model might perform on new, similar data.
Model Assessment: Evaluating whether a model can generalise from known data to predict unknown data accurately.

Using Cross-Validation for Multiple Models¶

Primary Reasons:

To identify the model with the best predictive accuracy.
To select a model that has learned the most from the data, offering the best generalisation.
To combine predictions from multiple models based on their estimated performance (e.g., Bayesian Model Averaging).

When Not to Use Cross-Validation?¶

Cross-validation isn’t always necessary, especially when no model selection is required. It’s not suitable for testing whether a specific effect is non-zero (hypothesis testing about parameters), as it focuses on predictive accuracy rather than parameter inference directly.

Key Components of Cross-Validation¶

Data Division: How data is split (e.g., LOO - Leave-One-Out, LOGO - Leave-One-Group-Out, LFO - Leave-Future-Out). LOO is commonly used in Bayesian contexts.
Utility/Loss Functions: Metrics used to evaluate predictions (e.g., RMSE, log predictive density). LOO often uses the log pointwise predictive density (lppd).
Computational Methods: Techniques used to compute predictive distributions efficiently (e.g., K-fold CV, PSIS - Pareto Smoothed Importance Sampling for LOO).

Explainer for LOO Metrics (from ArviZ)¶

ArviZ calculates LOO-CV using PSIS and provides three key metrics:

1. LOO ELPD (Expected Log Pointwise Predictive Density)¶

What it tells you: An estimate of the model’s out-of-sample predictive accuracy on a log scale. Higher values indicate better expected predictive performance on new data.
Think of it as: A score for your model’s predictive accuracy. It’s the primary metric for comparing models based on prediction.

2. LOO p_loo (Effective Number of Parameters)¶

What it tells you: An estimate of the effective number of parameters in the model, penalising for overfitting. It reflects how flexible the model is in fitting the data.
Think of it as: A measure of model complexity. A lower p_loo suggests a simpler model relative to its predictive performance. Very high p_loo relative to the actual number of parameters can indicate model misspecification or influential observations.

3. LOO Standard Error (SE)¶

What it tells you: The uncertainty associated with the LOO ELPD estimate. A smaller standard error indicates more confidence in the ELPD score.
Think of it as: A margin of error for your ELPD. When comparing two models, if the difference in their ELPD scores is small relative to the standard error of the difference, the models might be considered to have similar predictive performance.

Important Considerations:

Consider all three metrics together. Don’t rely solely on ELPD. A model with high ELPD but also very high p_loo or a large SE might be overfitting or unstable.
Compare models. Calculate these metrics for different models fitted to the same data. The difference in ELPD between models, considered alongside the standard error of the difference, helps determine which model predicts better. ArviZ’s az.compare function facilitates this.
Check Pareto k diagnostics. PSIS-LOO relies on importance sampling. ArviZ provides Pareto k diagnostic values for each observation. High k values (e.g., > 0.7) indicate unreliable LOO estimates for those points, suggesting the model might be misspecified or the observation is highly influential.

References¶

Vehtari, A., Gelman, A., and Gabry, J. (2017). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC,” Statistics and Computing, 27(5): 1413–1432. doi:10.1007/s11222-016-9696-4 (arXiv:1507.04544).
Vehtari, A. and Ojanen, J. (2012) ‘A survey of Bayesian predictive methods for model assessment, selection and comparison’, Statistics Surveys, 6, pp. 142–228. doi: 10.1214/12-SS102.
Stan LOO Glossary: https://mc-stan.org/loo/articles/online-only/glossary.html
Stan LOO FAQ (Source for parts of this document): https://mc-stan.org/loo/articles/online-only/faq.html