Model Bake-off
How the prediction model was chosen — seven forecasting approaches, scored head-to-head under cross-validation on 1,667 World-Cup-team matches since 2012.
METHODOLOGY
Every candidate is trained and tested on the same data with the same protocol, so differences reflect the model, not the split. We evaluate on held-out World-Cup-vs-World-Cup matches (the prediction target) under two cross-validation schemes, each repeated 100× with bootstrap resampling for confidence intervals:
- Forward — train on the past, predict the future (test = most recent matches, from 2023-07-12). This mirrors the real tournament task and is the headline scheme.
- Random — stratified-by-date holdout (interpolation). A sanity check against the forward result.
Each forecast is graded on two proper scoring rules and a calibration decomposition:
- RPS (Ranked Probability Score) — accuracy of the Win/Draw/Loss probabilities; the standard football metric. Lower is better.
- Log-score — accuracy of the full scoreline distribution (only for models that produce one). Lower is better.
- Calibration (CORP / Murphy decomposition) — splits the score into MCB (miscalibration — are the probabilities honest? lower is better) and DSC (discrimination — do they separate outcomes? higher is better). This is why two models with similar RPS can be very different.
Significance is read from the bootstrap 95% intervals, the fraction of resamples each model beats the baseline, and a Diebold–Mariano test on the forward test set.
THE CONTENDERS
RESULTS
| Model | RPS ↓ | log-score ↓ | MCB ↓ | DSC ↑ | beats base |
|---|---|---|---|---|---|
| DC + squad priorBEST | 0.2075[0.202, 0.227] | 2.945 | 0.0047 | 0.0282 | 95%DM p=0.41 |
| Base DCREFERENCE | 0.2093[0.203, 0.230] | 2.960 | 0.0053 | 0.0273 | — |
| Davidson (recal.) | 0.2125[0.205, 0.218] | — | 0.0091 | 0.0287 | 11%DM p=0.08 |
| Davidson | 0.2126[0.206, 0.219] | — | 0.0092 | 0.0287 | 11%DM p=0.07 |
| Dynamic DC | 0.2188[0.210, 0.230] | — | 0.0105 | 0.0244 | 11%DM p=0.64 |
| Margin Elo | 0.2202[0.211, 0.228] | — | 0.0117 | 0.0243 | 10%DM p=0.04 |
| GBM | 0.2241[0.215, 0.237] | — | 0.0147 | 0.0241 | 2%DM p=0.08 |
RPS / log-score show mean over 100 resamples with 95% interval. “beats base” = share of resamples with lower RPS than Base DC (forward also shows the Diebold–Mariano p-value).
CALIBRATION (RELIABILITY DIAGRAM)
Forecast probability vs observed frequency for home wins (forward test set). Points on the dashed diagonal are perfectly calibrated; below it = over-confident. Base DC and the squad-prior model hug the diagonal; the tree model strays furthest.
WHAT WE LEARNED
The parsimonious scoreline model is hard to beat. Reducing to a Win/Draw/Loss model (Davidson), a pure rating system (Elo), or a flexible tree (GBM) all lose to Base DC — modeling the full scoreline yields naturally better-calibrated outcome probabilities.
Only orthogonal information moved the needle. The one model that beats Base DC, DC + squad prior, adds squad market value & reputation — signal independent of past results — as a structured prior. Tellingly, the same squad data fed to the flexible GBM overfit; folding it in as a prior, not free parameters, is what worked.
Recency is weak here. The time-varying model (Dynamic DC) only ties Base DC on the forward task and is worse-calibrated — international team strength drifts slowly.