Model Bake-off

How the prediction model was chosen — seven forecasting approaches, scored head-to-head under cross-validation on 1,667 World-Cup-team matches since 2012.

METHODOLOGY

Every candidate is trained and tested on the same data with the same protocol, so differences reflect the model, not the split. We evaluate on held-out World-Cup-vs-World-Cup matches (the prediction target) under two cross-validation schemes, each repeated 100× with bootstrap resampling for confidence intervals:

Each forecast is graded on two proper scoring rules and a calibration decomposition:

Significance is read from the bootstrap 95% intervals, the fraction of resamples each model beats the baseline, and a Diebold–Mariano test on the forward test set.

THE CONTENDERS

Base DCStatic Dixon–Coles bivariate Poisson — the reference model.
DC + squad priorDixon–Coles with team strengths shrunk toward squad market-value & overall (empirical-Bayes prior from orthogonal squad data).
GBMGradient-boosted trees on Dixon–Coles, Elo and squad-difference features.
DavidsonBradley–Terry–Davidson — models the W/D/L outcome directly (no scoreline).
Davidson (recal.)Davidson with post-hoc temperature recalibration.
Margin EloMargin-aware Elo (date-ordered) with a fitted ordinal W/D/L map.
Dynamic DCGlicko-style time-varying ratings — recency- and uncertainty-aware.

RESULTS

ModelRPS ↓log-score ↓MCB ↓DSC ↑beats base
DC + squad priorBEST0.2075[0.202, 0.227]2.9450.00470.028295%DM p=0.41
Base DCREFERENCE0.2093[0.203, 0.230]2.9600.00530.0273
Davidson (recal.)0.2125[0.205, 0.218]0.00910.028711%DM p=0.08
Davidson0.2126[0.206, 0.219]0.00920.028711%DM p=0.07
Dynamic DC0.2188[0.210, 0.230]0.01050.024411%DM p=0.64
Margin Elo0.2202[0.211, 0.228]0.01170.024310%DM p=0.04
GBM0.2241[0.215, 0.237]0.01470.02412%DM p=0.08

RPS / log-score show mean over 100 resamples with 95% interval. “beats base” = share of resamples with lower RPS than Base DC (forward also shows the Diebold–Mariano p-value).

CALIBRATION (RELIABILITY DIAGRAM)

Forecast probability vs observed frequency for home wins (forward test set). Points on the dashed diagonal are perfectly calibrated; below it = over-confident. Base DC and the squad-prior model hug the diagonal; the tree model strays furthest.

000.250.250.50.50.750.7511forecast P(home win)observed frequency

WHAT WE LEARNED

The parsimonious scoreline model is hard to beat. Reducing to a Win/Draw/Loss model (Davidson), a pure rating system (Elo), or a flexible tree (GBM) all lose to Base DC — modeling the full scoreline yields naturally better-calibrated outcome probabilities.

Only orthogonal information moved the needle. The one model that beats Base DC, DC + squad prior, adds squad market value & reputation — signal independent of past results — as a structured prior. Tellingly, the same squad data fed to the flexible GBM overfit; folding it in as a prior, not free parameters, is what worked.

Recency is weak here. The time-varying model (Dynamic DC) only ties Base DC on the forward task and is worse-calibrated — international team strength drifts slowly.