2026 FIFA World Cup

Prediction model — design and methodology

Premise and approach

I wanted to do a clean statistical build-out of a real-world situation during this Utah chapter, but my original plan to try and predict snowfall in LCC proved to be much more complex than expected. As such, given the upcoming World Cup and my personal interests in soccer, I decided to think through and build out a model for the tournament, assisted, as usual, by the beast Claude Code.

The work contained here has strong contribution from Claude - I would not have been able to get anywhere near this far without its help. At the same time, I've contributed key insights and advancements where Claude would have otherwise fell into a loop, lost track of the overall objective, or simply given up from data limitation excuses. It goes to show what the current state of Human-AI collaboration looks like.

Philosophy

Sports analytics is a rich field with some very smart people involved, and I don't necessarily see this beating out commercial models or highly-specialized pundits. What I do see this as is a clean bottom-up statistical framework for thinking about a team-sport tournament, and I hope that the statistical methodology framed below does the Yale Department of Statistics justice.

In our model, we're not just picking a "winner" for each match-up - instead, we're modeling the underlying expected number of goals each team will score, then use that to feed a prediction for each matchup. We'll build this model off the (https://www.ajbuckeconbikesail.net/wkpapers/Airports/MVPoisson/soccer_betting.pdf)[Dixon-Coles model], a well-established model for soccer that modifies an underlying Poisson distribution model for goals scored. The work below adds a couple additional pieces to simply fitting parameters of the Dixon-Coles model to historical international matchups, summarized in the schematic below.

team ifitted parameters

α_i— attack strength (scalar)

β_i— defensive weakness (scalar)

u_i— attack style (ℝ⁴ vector)

v_i— defence style (ℝ⁴ vector)

team jfitted parameters

α_j— attack strength (scalar)

β_j— defensive weakness (scalar)

u_j— attack style (ℝ⁴ vector)

v_j— defence style (ℝ⁴ vector)

neutral venue · γ = 1 · bilinear style interaction

expected goalsdeterministic

log μ_ij= log α_i + log β_j + u_i·v_j

log ν_ij= log α_j + log β_i + u_j·v_i

u_i·v_j is the style residual — zero when styles are neutral, positive/negative otherwise

independent Poisson assumption

goals scoredrandom variables

X~ Pois(μ_ij)

Y~ Pois(ν_ij)

Dixon-Coles τ(x,y; ρ) correction on low-scoring cells

joint distributioncorrected

P̃(x, y) = τ(x, y; ρ) · Pois(x; μ) · Pois(y; ν)

ρ ≤ 0 inflates 0-0 probability; re-estimated jointly in stage 2

sum over 10×10 scoreline grid

P(win i)

Σ_{x>y} P̃

P(draw)

Σ_{x=y} P̃

P(win j)

Σ_{x<y} P̃

stage 1 — α, β, ρ by MLE · w_k = e^−φd · s_k · δⁿ

stage 2 — u, v, ρ by MLE + L2 · log μ_ij = log α_i + log β_j + u_i·v_j

The Poisson foundation

We begin with all of our modeling by understanding the Poisson distribution, which models the occurrence of rare, independent events (already an approximation, since goals are, by their very nature, very dependent on previous goals). Suspending our disbelief, if we model a specific team $i$ as expected to score $\mu$ goals against team $j$ , then the probability of actually scoring exactly $x$ goals is:

P(X = x) = \frac{e^{-\mu}\,\mu^x}{x!}

μ = 1.55

mode = 1P(0) = 0.223P(1) = 0.335

The parameter $\mu$ is what we are trying to estimate. It is not a fixed constant — it depends on who is playing and their relative strengths. In our naive model, we define $\mu$ as a product of three factors:

\mu_{ij} = \alpha_i \cdot \beta_j \cdot \gamma

\alpha_i

attack strength of the scoring team

\beta_j

defensive weakness of the conceding team (higher = leakier)

\gamma

home advantage multiplier (~1.2 at club level; 1.0 at neutral sites)

The product form models how these factors compound - if $\alpha_i = 1.5$ (a strong attack, 50% above average) and $\beta_j = 1.4$ (a very leaky defense, 40% above average), the expected goals rate is $1.5 \times 1.4 = 2.1$ (110% more likely to score), not $1.5 + 1.4 - 1 = 1.9$ (90% more likely to score). Previous work with football data shows that the multiplicative model works better than additive alternatives.

The expected goals for the other team are definitionally symmetric:

\mu_{ji} = \alpha_j \cdot \beta_i

With these two parameters, we can apply the Poisson model to set up an independent joint bivariate distribution. Again, this is a simplification - obviously, a team down a goal will play more conservatively, changing the math, but we're coming up with a approximate model, not a perfect description.

Under independence, the joint probability of a specific scoreline $(x, y)$ is just the product of the two marginal Poisson probabilities:

P(X = x,\; Y = y) = \frac{e^{-\mu}\,\mu^x}{x!} \cdot \frac{e^{-\nu}\,\nu^y}{y!}

Team i

α_i1.50

β_i1.00

μ = α_i · β_j = 1.50 xG

Team j

α_j1.00

β_j1.00

ν = α_j · β_i = 1.00 xG

		Team i goals (x) →
		0	1	2	3	4	5+
← Team j goals (y)	0	8.2%	12.3%	9.2%	4.6%	1.7%	0.7%
	1	8.2%	12.3%	9.2%	4.6%	1.7%	0.7%
	2	4.1%	6.2%	4.6%	2.3%	0.9%	0.3%
	3	1.4%	2.1%	1.5%	0.8%	0.3%	0.1%
	4	0.3%	0.5%	0.4%	0.2%	0.1%	<.1%
	5+	0.1%	0.1%	0.1%	<.1%	<.1%	<.1%

i winsdrawj wins

P(i wins) = 48.8%P(draw) = 26.0%P(j wins) = 25.2%

To get outcome probabilities, we sum over all scorelines in the appropriate region. For all practical purposes we can truncate at ten goals per team, which captures over 99.9% of the probability mass for any realistic $\mu, \nu < 4$ :

\begin{aligned} P(\text{win}_i) &= \sum_{x > y} P(X=x, Y=y) \\[4pt] P(\text{draw}) &= \sum_{x = y} P(X=x, Y=y) \\[4pt] P(\text{win}_j) &= \sum_{x < y} P(X=x, Y=y) \end{aligned}

For World Cup matches played at neutral sites — which describes the entire 2026 tournament — we set $\gamma = 1$ .

The Dixon-Coles correction

The independent Poisson model has a known failure mode: it systematically underestimates the frequency of 0-0 draws, overestimates 1-1 draws, and slightly misestimates 1-0 and 0-1 results. The reason is that real matches have game states. When a match is 0-0, both teams are often pushing harder to break the deadlock — the game opens up, but also tightens tactically. When one team goes ahead, they may sit back and protect, suppressing further scoring. These dynamics create a mild negative correlation between the two teams' goals precisely at low-scoring outcomes, which the independence assumption cannot capture.

Dixon and Coles addressed this with an elegant correction. Rather than replacing the Poisson model entirely, they multiply the joint probability by a correction factor $\tau$ that only modifies the four cells where $x + y \leq 1$ :

\tilde{P}(x, y) = \tau(x, y,\; \mu, \nu,\; \rho) \cdot P(x, y)

\tau(x,y) = \begin{cases} 1 - \rho\mu\nu & (x,y) = (0,0) \\[2pt] 1 + \rho\mu & (x,y) = (0,1) \\[2pt] 1 + \rho\nu & (x,y) = (1,0) \\[2pt] 1 - \rho & (x,y) = (1,1) \\[2pt] 1 & \text{otherwise} \end{cases}

The single parameter $\rho \leq 0$ controls the magnitude of the adjustment. With $\rho < 0$ , the factor $\tau(0,0) = 1 - \rho\mu\nu > 1$ inflates the 0-0 probability, and probability mass is redistributed away from the 1-1 cell. The correction is valid — that is, the corrected probabilities still sum to one across all scorelines — by construction.

Empirically, fits on European league data give $\rho \approx -0.13$ . International football, which features fewer high-scoring matches and more conservative defensive tactics, tends to produce slightly more negative values. Our fit on 1,908 WC-team matches yielded $\rho \approx -0.11$ , consistent with this range. We treat $\rho$ as a free parameter and estimate it jointly with the team strength parameters.

Composite data decay

This is one of the main ways that we refine the naive model that simply trains the alpha and beta parameters of the Dixon-Coles setup with historical data. The most naive approach, of course, would simply be to treat each game as contributing the same amount of signal. Obviously, that doesn't make sense, so the slightly less naive approach is to weight each match result's contribution to the model fitting by some decay factor. Instead of just using that, our model integrates three additional signals to weight how much each match informs the training, combining it into a composite factor:

w_k = \underbrace{e^{-\phi \, d_k}}_{\text{calendar decay}} \;\cdot\; \underbrace{s_k}_{\text{competition quality}} \;\cdot\; \underbrace{\delta^{\,n_k}}_{\text{manager epochs}}

Each factor captures a distinct mechanism by which a historical result becomes more or less representative of the current team.

Calendar decay

The first factor, $e^{-\phi d_k}$ , is standard exponential decay over elapsed days. Even with the same manager and squad, teams evolve: tactics develop, individual form rises and falls, set-piece routines change. We set the half-life to 600 days (~20 months), meaning a result from 600 days ago receives half the weight of a result from today. This is calibrated for international football, where squad turnover is slower and tactical systems are more stable than at club level.

Competition quality

The second factor, $s_k$ , reflects that not all matches are equally diagnostic. A win over a top-10 opponent at a major tournament tells you much more about a team's current ceiling than a friendly against a mid-table qualifier. We weight matches by competition tier:

Competition	Weight
FIFA World Cup	1.0
Continental championships (Euro, Copa América, AFCON, Asian Cup)	0.9
Nations Leagues (UEFA, CONCACAF)	0.65
World Cup qualifiers	0.6
International friendlies	0.3

For the major footballing nations with abundant high-quality data, friendly results end up contributing very little total weight. For smaller nations whose few World Cup appearances are separated by decades, qualifier and friendly matches remain important training signal.

Manager epoch discount

The third factor, $\delta^{n_k}$ , penalizes results from previous managerial regimes. Here $n_k$ is the number of manager changes that have occurred on either team between match $k$ and today, and $\delta \in (0, 1)$ is a per-transition discount.

A new manager represents a partial reset of tactical identity — the same eleven players will press differently, organize differently, and execute set pieces differently. Results from a prior regime are still informative because individual player quality persists, but they are weaker guides to what this team will do today. We use $\delta = 0.85$ , meaning each manager change reduces a match's effective weight by ~15%. This is intentionally moderate: the calendar and competition-quality components already do most of the down-weighting work, so the epoch factor handles the residual discontinuity that smooth decay misses without over-penalizing teams with recent transitions.

As an example: a 2-1 win under a manager who left three seasons ago, after two subsequent regime changes, receives $0.85^2 = 0.72$ of the weight it would have had under the current manager — on top of whatever calendar and competition discounts already apply.

Player continuity (designed, pending data)

The full specification includes a fourth factor, $c_k^{\,\alpha}$ , measuring the importance-weighted overlap between the squad that played in match $k$ and the current national squad:

c_k = \frac{\displaystyle\sum_{i \;\in\; \text{squad}_k} \text{imp}(i) \cdot \mathbf{1}[\text{active now}]}{\displaystyle\sum_{i \;\in\; \text{squad}_k} \text{imp}(i)}

\text{imp}(i)

importance score for player i: caps × position weight

\mathbf{1}[\text{active now}]

1 if player i is still in the current squad, 0 otherwise

A match where nine of eleven starters are still active has $c_k \approx 0.9$ ; a result from six years ago featuring players who have since retired might have $c_k \approx 0.2$ . This factor is designed and its weighting is defined, but not yet incorporated into the current fit — it is pending historical squad roster data collection. The current model runs on calendar decay, competition quality, and manager epochs only.

Model architecture

The estimation pipeline runs in three sequential stages, each a residual layer on top of the prior.

Stage 1 fits one scalar attack and one scalar defence parameter per team from historical match outcomes alone — the classic Dixon-Coles MLE. It captures how many goals each team typically scores and concedes across all opponents.

Stage 2 adds a bilinear style interaction between every team pair, also trained from match outcomes. It captures matchup-specific edges: whether the way team $i$ attacks creates a systematic advantage against the way team $j$ defends, beyond what raw scalar strength predicts.

Stage 3 introduces a player attribute correction derived from current squad ratings. It captures a complementary signal: given who is actually in the squad today, does the current roster composition suggest an adjustment to what historical match data alone predicts? Stages 1 and 2 are retrospective — they answer how a team has performed. Stage 3 is prospective — it answers how strong this particular group of players looks right now.

Each stage is fit with prior-stage parameters frozen, learning only the residual that earlier stages cannot explain. Stages 1 and 2 are fully estimated and active. Stage 3 is designed and implemented in parallel — its integration into the live simulation is pending confirmed 23-man squad announcements ahead of the tournament.

inputHistorical Match Data

World Cup— competition weight × 1.0

Continental championships— competition weight × 0.9

World Cup qualifiers— competition weight × 0.6

International friendlies— competition weight × 0.3

weightingComposite Time Decaykey innovation

e^(−φd)Calendar driftolder results receive less weight as time passes

δ^nManager epochsdiscount applied per regime change since match

c^αPlayer continuityimportance-weighted overlap of squad with today's

stage 1DC Base — MLE

αᵢ— attack strength, one scalar per team

βᵢ— defensive weakness, one scalar per team

ρ— low-score correlation (Dixon-Coles)

→ frozen— parameters fixed before stage 2

stage 2Style Layer — Bilinear ResidualResNet correction

uᵢ ∈ ℝ⁴Attack stylelatent vector per team, trained from match outcomes

vⱼ ∈ ℝ⁴Defence stylelatent vector per team, trained from match outcomes

uᵢ · vⱼInteractionsigned scalar residual on log expected goals

predictionPer-Match Scoreline Distribution

log μᵢⱼ = log αᵢ + log βⱼ + uᵢ·vⱼ— bilinear expected goals

Dixon-Coles P̃(x, y)— corrected joint scoreline distribution

P(win / draw / loss)— summed over the 10×10 scoreline grid

simulationMonte Carlo Tournament

N = 100,000— independent full-bracket draws

Group stage— sample scorelines → points → tiebreakers

Knockouts— extra time draw + penalty coin-flip if level

Output— P(champion / finalist / upset) per team

Stage 1 — DC base layer

The first stage fits one scalar attack parameter $\alpha_i$ and one scalar defence parameter $\beta_i$ per team. These are estimated from historical match scorelines via maximum likelihood — no shot-level data required. The composite decay weights ensure that stale, low-quality, or pre-regime-change matches contribute minimally to the final values.

The fit runs on 1,908 matches between the 48 WC-qualified teams from January 2010 onward. Fitted globals: $\hat{\rho} \approx -0.11$ , home advantage $\hat{\gamma} \approx 1.17$ (neutralized at the 2026 tournament). The scalar parameters are then frozen before Stage 2 begins.

Stage 2 — Bilinear style layer

Scalar $\alpha_i, \beta_j$ summarize team quality as single numbers. They cannot capture matchup-specific dynamics: whether France's defensive block is particularly effective against teams that prefer wide diagonal balls, or less effective against high-tempo pressing sides. That information is invisible to a scalar model.

Each team is assigned a $K$ -dimensional attack style vector $\mathbf{u}_i$ and defence style vector $\mathbf{v}_i$ . Their dot product adds a signed scalar residual on top of the DC log-expected-goals:

\log \mu_{ij} = \underbrace{\log \alpha_i + \log \beta_j}_{\text{Stage 1}} \;+\; \underbrace{\mathbf{u}_i \cdot \mathbf{v}_j}_{\text{style residual}}

\mathbf{u}_i \in \mathbb{R}^K

attack style vector for team i

\mathbf{v}_j \in \mathbb{R}^K

defence style vector for team j

K = 4

latent style dimensions

The dot product is positive when team $i$ 's attacking style creates a systematic advantage against team $j$ 's defensive shape, and negative when it does not. With $K = 4$ dimensions the model can represent up to four independent axes of stylistic variation. In practice, the average matchup adjustment is about 2%; the largest is approximately +57% (Spain attacking France), a pattern visible across multiple recent encounters.

Style vectors are initialized with small Gaussian noise ( $\sigma = 0.05$ ) to break the zero-gradient saddle at the origin, then estimated by maximizing the DC likelihood with L2 regularization ( $\lambda = 2.0$ ) on $\mathbf{u}_i, \mathbf{v}_j$ . Stage 1 parameters remain frozen throughout. Stage 2 converged in 116 iterations; the refined correlation parameter is $\hat{\rho} \approx -0.14$ .

Stage 3 — Player attribute correction

Stages 1 and 2 draw on a single data source: historical match outcomes. They answer how a team has performed. Player attribute data answers a different question: how strong is this team right now, given who is actually in the squad?

These two signals are complementary but not redundant. A team that surged in form two years ago but has since lost key players will have inflated DC parameters relative to its current ability. A team that quietly upgraded its squad during a flat qualifying campaign will be underrated. A young nation with sparse international history but a generation of technically developed players currently dominating European club football has almost no training signal in the DC fit — its $\alpha, \beta$ will be conservative by default. Neither the scalar parameters nor the style vectors are designed to catch any of this: their training signal is entirely retrospective.

Stage 3 introduces a correction term $\gamma_{ij}$ derived from a separate linear model trained on current player ratings, then centered to remove systematic bias before being applied to the DC output.

Squad embeddings

Player ratings are sourced from the EA FC 26 database, which covers 18,400+ players and includes all 48 WC-qualified nations. For each team, the squad is divided into four positional groups — goalkeeper, defenders, midfielders, forwards — and the mean attribute vector is computed within each group. The full team embedding concatenates these four group means:

\mathbf{e}_i = \bigl[\,\bar{\mathbf{x}}_{\text{GK}},\;\bar{\mathbf{x}}_{\text{DEF}},\;\bar{\mathbf{x}}_{\text{MID}},\;\bar{\mathbf{x}}_{\text{FWD}}\,\bigr] \;\in\; \mathbb{R}^{4M}

where $M$ is the number of attributes per group. This preserves positional structure: a team with elite defenders and average midfielders produces a different embedding from one with the reverse imbalance, even if the squad-wide averages are identical. Data-sparse nations (Kenya, Honduras, Jordan) where DC parameters are least reliable are fully covered, since every FIFA-registered player has ratings regardless of league visibility.

The linear model

For each historical match $k$ , the attribute model independently predicts log goals for team $i$ as a linear function of both teams' positional embeddings:

\hat{y}_{ij} = \mathbf{w}^{\top} \bigl[\mathbf{e}_{i}^{\,\text{atk}};\;\mathbf{e}_{j}^{\,\text{def}}\bigr] + b

where $\mathbf{e}_{i}^{\,\text{atk}}$ selects team $i$ 's midfield and forward embeddings and $\mathbf{e}_{j}^{\,\text{def}}$ selects team $j$ 's goalkeeper and defender embeddings. The training target is $\log(X_k + 0.5)$ , where $X_k$ is the observed goals scored. The log offset maps zero-goal outcomes to a finite value and keeps the target in the same scale as the DC log-expected-goals outputs. The model is fit by ridge regression using the same composite decay weights as Stage 1 — meaning recent, high-quality matches dominate the fit here too.

Centering and application

The linear model is not trusted for its absolute level: it lacks Poisson structure, has no temporal decay within matches, and is trained using current ratings projected back onto historical matches where squads may have differed substantially. What it is trusted for is its relative signal — which matchups look better or worse given current squad composition, compared to the average.

For each training match $k$ , compute the residual between the linear prediction and the Stage 1+2 DC output:

\delta_k = \hat{y}_k - \log \mu_k^{\scriptscriptstyle\mathrm{DC}}

The mean residual $\bar{\delta}$ is the systematic offset between the two models — reflecting both scale differences and the fact that current ratings are an imperfect proxy for historical squad quality. Subtracting it from each prediction leaves only the matchup-relative signal:

\gamma_{ij} = \hat{y}_{ij} - \bar{\delta}

By construction, $\gamma$ has zero mean across training data. Applying it does not inflate or deflate aggregate expected goals — it only shifts individual matchups relative to the DC baseline. The final expected goals for any WC 2026 match are:

\log \mu_{ij}^{\text{final}} = \log \mu_{ij}^{\scriptscriptstyle\mathrm{DC}} + \lambda \cdot \gamma_{ij}

The shrinkage factor $\lambda \in (0,\,1]$ governs how much weight to place on the attribute correction relative to historical outcomes. It is calibrated on held-out matches. A smaller $\lambda$ defers more to the historical record; a larger one gives more weight to current roster composition. The Dixon-Coles $\tau$ correction is applied to the final $\mu_{ij}^{\text{final}}$ , not the intermediate DC output, so the low-score cell adjustment remains consistent with the corrected expected goals.

Tournament simulation

With the full scoreline distribution $\tilde{P}(x, y)$ available for any match between any two teams, we simulate the entire tournament structure by Monte Carlo. Each iteration proceeds as follows:

Group stage. For each of the 104 group-stage matches, sample a scoreline from $\tilde{P}(x, y)$ . Accumulate group points (3 for a win, 1 for a draw, 0 for a loss) and rank teams by points, then goal difference, then goals scored, then head-to-head result.
Knockout rounds. Draws are resolved by extra time (modeled as a second independent Poisson draw at half the regular rate) and if still level, a penalty shootout modeled as a coin flip.
Repeat. Run $N = 100{,}000$ full simulations. The reported probability for any event is simply its frequency across all simulations. One hundred thousand draws is sufficient for stable estimates down to roughly 0.1% probability.

Before the simulation loop, the full $48 \times 48$ matrix of expected goals is precomputed once and reused across all draws, keeping per-simulation cost efficient.

Parameter estimation

Stage 1 — DC base. Attack and defence parameters $\alpha_i, \beta_j$ and correlation $\rho$ are estimated by maximizing the composite-weighted DC log-likelihood:

\mathcal{L}_1 = \sum_k w_k \cdot \log \tilde{P}(x_k, y_k \mid \alpha_{i_k}, \beta_{j_k}, \rho)

Log-link reparameterization ( $\alpha_i = e^{a_i}$ , $\beta_i = e^{b_i}$ ) ensures positivity without box constraints. Scale is fixed by constraining $\mathrm{mean}(\{a_i\}) = 0$ , so attack ratings are relative to the 48-team geometric mean.

Stage 2 — Style vectors. With $\alpha_i, \beta_j$ frozen, style vectors and a refined $\rho$ are estimated by maximizing the same DC likelihood with a bilinear expected goals term, plus an L2 penalty:

\mathcal{L}_2 = \sum_k w_k \cdot \log \tilde{P}(x_k, y_k \mid \mathbf{u}, \mathbf{v}, \rho) \;-\; \frac{\lambda}{2}\!\left(\|\mathbf{U}\|_F^2 + \|\mathbf{V}\|_F^2\right)

Analytical gradients for both the Poisson terms and the DC correction cells are passed directly to L-BFGS-B, avoiding finite-difference approximation over the 385-parameter space. Vectors are initialized with small Gaussian noise ( $\sigma = 0.05$ ) to break the zero-gradient saddle at the origin.

Stage 3 — Attribute correction. The ridge coefficient vector $\mathbf{w}$ is fit by minimizing composite-weighted squared error between predicted and observed log goals. The ridge penalty is cross-validated on held-out matches. The centering constant $\bar{\delta}$ and shrinkage factor $\lambda$ are computed on the training split.

Limitations

Temporal alignment in Stage 3. The current implementation trains the attribute correction using today's FC 26 ratings projected back onto historical matches, where the actual squads may have differed significantly. The correct approach aligns each match with the FIFA ratings edition closest to its date. Historical editions (FIFA 15–23) are available; this alignment is in progress. The centering step absorbs the systematic component of the mismatch, but matchup-specific noise remains.

Player continuity not yet active. The $c_k^{\,\alpha}$ squad-overlap component of the composite weight is designed but pending historical squad roster data. It would specifically correct cases where a team's historical record is dominated by a generation of players since retired — calendar decay handles this only loosely.

Style vectors are team-level, not player-level. The bilinear layer captures aggregate stylistic matchup effects but cannot represent individual player interactions: whether a particular striker's movement exploits a specific centre-back's positional tendencies, or whether a midfield pair creates unusual press resistance against a specific pressing shape.

No within-match dynamics. The Poisson model assumes a constant goal rate throughout the match. In reality, the rate shifts with game state — a team trailing by one in the 75th minute plays very differently from a team level at half-time. The Dixon-Robinson extension addresses this but requires modelling the full score path rather than just the final result.

Penalty shootouts as coin flips. The model treats shootouts as 50-50. Historical shootout records, specialist penalty takers, and goalkeeper save rates are all partially predictive and worth incorporating as the knockout rounds approach.