A Hierarchical model for Rugby prediction#

In this example, we’re going to reproduce the first model described in Baio and Blangiardo [2010] using PyMC. Then show how to sample from the posterior predictive to simulate championship outcomes from the scored goals which are the modeled quantities.

We apply the results of the paper to the Six Nations Championship, which is a competition between Italy, Ireland, Scotland, England, France and Wales.

Motivation#

Your estimate of the strength of a team depends on your estimates of the other strengths

Ireland are a stronger team than Italy for example - but by how much?

Source for Results 2014 are Wikipedia. I’ve added the subsequent years, 2015, 2016, 2017. Manually pulled from Wikipedia.

We want to infer a latent parameter - that is the ‘strength’ of a team based only on their scoring intensity, and all we have are their scores and results, we can’t accurately measure the ‘strength’ of a team.
Probabilistic Programming is a brilliant paradigm for modeling these latent parameters
Aim is to build a model for the upcoming Six Nations in 2018.

Attention

This notebook uses libraries that are not PyMC dependencies and therefore need to be installed specifically to run this notebook. Open the dropdown below for extra guidance.

!date

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt
import seaborn as sns

from matplotlib.ticker import StrMethodFormatter

sáb 25 abr 2026 14:19:24 EEST

az.style.use("arviz-variat")
plt.rcParams["figure.constrained_layout.use"] = False

This is a Rugby prediction exercise. So we’ll input some data. We’ve taken this from Wikipedia and BBC sports.

try:
    df_all = pd.read_csv("../data/rugby.csv", index_col=0)
except:
    df_all = pd.read_csv(pm.get_data("rugby.csv"), index_col=0)

What do we want to infer?#

We want to infer the latent parameters (every team’s strength) that are generating the data we observe (the scorelines).
Moreover, we know that the scorelines are a noisy measurement of team strength, so ideally, we want a model that makes it easy to quantify our uncertainty about the underlying strengths.
Often we don’t know what the Bayesian Model is explicitly, so we have to ‘estimate’ the Bayesian Model’
If we can’t solve something, approximate it.
Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.
Fortunately, this algorithm can be applied to almost any model.

What do we want?#

We want to quantify our uncertainty
We want to also use this to generate a model
We want the answers as distributions not point estimates

Visualization/EDA#

We should do some some exploratory data analysis of this dataset.

The plots should be fairly self-explantory, we’ll look at things like difference between teams in terms of their scores.

df_all.describe()

	home_score	away_score	year
count	60.000000	60.000000	60.000000
mean	23.500000	19.983333	2015.500000
std	14.019962	12.911028	1.127469
min	0.000000	0.000000	2014.000000
25%	16.000000	10.000000	2014.750000
50%	20.500000	18.000000	2015.500000
75%	27.250000	23.250000	2016.250000
max	67.000000	63.000000	2017.000000

# Let's look at the tail end of this dataframe
df_all.tail()

	home_team	away_team	home_score	away_score	year
55	Italy	France	18	40	2017
56	England	Scotland	61	21	2017
57	Scotland	Italy	29	0	2017
58	France	Wales	20	18	2017
59	Ireland	England	13	9	2017

There are a few things here that we don’t need. We don’t need the year for our model. But that is something that could improve a future model.

Firstly let us look at differences in scores by year.

df_all["difference"] = np.abs(df_all["home_score"] - df_all["away_score"])

(
    df_all.groupby("year")["difference"]
    .mean()
    .plot(
        kind="bar",
        title="Average magnitude of scores difference Six Nations",
        yerr=df_all.groupby("year")["difference"].std(),
        figsize=(10, 4),
    )
    .set_ylabel("Point difference (abs)")
);

../_images/2c07a88b5d94120d6daeada914bebc3be691ed88b48fa5c481ffcaef77436150.png

We can see that the standard error is large. So we can’t say anything about the differences. Let’s look country by country.

df_all["difference_non_abs"] = df_all["home_score"] - df_all["away_score"]

Let us first look at a Pivot table with a sum of this, broken down by year.

df_all.pivot_table("difference_non_abs", "home_team", "year")

year	2014	2015	2016	2017
home_team
England	7.000000	20.666667	7.500000	21.333333
France	6.666667	0.000000	-2.333333	4.000000
Ireland	28.000000	8.500000	17.666667	7.000000
Italy	-21.000000	-31.000000	-23.500000	-33.666667
Scotland	-11.000000	-12.000000	2.500000	16.666667
Wales	25.666667	1.000000	22.000000	4.000000

Now let’s first plot this by home team without year.

(
    df_all.pivot_table("difference_non_abs", "home_team")
    .rename_axis("Home_Team")
    .plot(kind="bar", rot=0, legend=False, figsize=(10, 4))
    .set_ylabel("Score difference\nHome team and away team")
);

../_images/a9f15339cf7adaeac72999edc4aeff6670957817c874321314aa5535d8807764.png

You can see that Italy and Scotland have negative scores on average. You can also see that England, Ireland and Wales have been the strongest teams lately at home.

(
    df_all.pivot_table("difference_non_abs", "away_team")
    .rename_axis("Away_Team")
    .plot(kind="bar", rot=0, legend=False, figsize=(10, 4))
    .set_ylabel("Score difference\nHome team and away team")
);

../_images/c25db89863f8354bd1e6a2a4c53c920992cb66b5c0f3f1214516ac15053f5faf.png

This indicates that Italy, Scotland and France all have poor away from home form. England suffers the least when playing away from home. This aggregate view doesn’t take into account the strength of the teams.

Let us look a bit more at a timeseries plot of the average of the score difference over the year.

We see some changes in team behaviour, and we also see that Italy is a poor team.

g = sns.FacetGrid(df_all, col="home_team", col_wrap=2, height=3.5)
g.map(sns.scatterplot, "year", "difference_non_abs")
g.fig.autofmt_xdate()

../_images/cc36541d03887c9425196fc981d895a18c3ff198cf3bc81b0280b8f88a31f98c.png

g = sns.FacetGrid(df_all, col="away_team", col_wrap=2, height=3.5)
g = g.map(plt.scatter, "year", "difference_non_abs").set_axis_labels("Year", "Score Difference")
g.fig.autofmt_xdate()

../_images/f2ac12d68672dd52768581781617297ca5ab674ef2b6e68895d7095fc6576c4a.png

You can see some interesting things here like Wales were good away from home in 2015. In that year they won three games away from home and won by 40 points or so away from home to Italy.

So now we’ve got a feel for the data, we can proceed on with describing the model.

What assumptions do we know for our ‘generative story’?#

We know that the Six Nations in Rugby only has 6 teams - they each play each other once
We have data from the last few years
We also know that in sports scoring is modelled as a Poisson distribution
We consider home advantage to be a strong effect in sports

The model.#

The league is made up by a total of T= 6 teams, playing each other once in a season. We indicate the number of points scored by the home and the away team in the g-th game of the season (15 games) as \(y_{g1}\) and \(y_{g2}\) respectively.

The vector of observed counts \(\mathbb{y} = (y_{g1}, y_{g2})\) is modelled as independent Poisson: \(y_{gi}| \theta_{gj} \tilde\;\; Poisson(\theta_{gj})\) where the theta parameters represent the scoring intensity in the g-th game for the team playing at home (j=1) and away (j=2), respectively.

We model these parameters according to a formulation that has been used widely in the statistical literature, assuming a log-linear random effect model:

\[log \theta_{g1} = home + att_{h(g)} + def_{a(g)} \]

\[log \theta_{g2} = att_{a(g)} + def_{h(g)}\]

The parameter home represents the advantage for the team hosting the game and we assume that this effect is constant for all the teams and throughout the season
The scoring intensity is determined jointly by the attack and defense ability of the two teams involved, represented by the parameters att and def, respectively
Conversely, for each t = 1, …, T, the team-specific effects are modelled as exchangeable from a common distribution:
\(att_{t} \; \tilde\;\; Normal(\mu_{att},\tau_{att})\) and \(def_{t} \; \tilde\;\;Normal(\mu_{def},\tau_{def})\)
We did some munging above and adjustments of the data to make it tidier for our model.
The log function to away scores and home scores is a standard trick in the sports analytics literature

Building of the model#

We now build the model in PyMC, specifying the global parameters, the team-specific parameters and the likelihood function

plt.rcParams["figure.constrained_layout.use"] = True
home_idx, teams = pd.factorize(df_all["home_team"], sort=True)
away_idx, _ = pd.factorize(df_all["away_team"], sort=True)
coords = {"team": teams}

with pm.Model(coords=coords) as model:
    # constant data
    home_team = pm.Data("home_team", home_idx, dims="match")
    away_team = pm.Data("away_team", away_idx, dims="match")

    # global model parameters
    home = pm.Normal("home", mu=0, sigma=1)
    sd_att = pm.HalfNormal("sd_att", sigma=2)
    sd_def = pm.HalfNormal("sd_def", sigma=2)
    intercept = pm.Normal("intercept", mu=3, sigma=1)

    # team-specific model parameters
    atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, dims="team")
    defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, dims="team")

    atts = pm.Deterministic("atts", atts_star - pt.mean(atts_star), dims="team")
    defs = pm.Deterministic("defs", defs_star - pt.mean(defs_star), dims="team")
    home_theta = pt.exp(intercept + home + atts[home_idx] + defs[away_idx])
    away_theta = pt.exp(intercept + atts[away_idx] + defs[home_idx])

    # likelihood of observed data
    home_points = pm.Poisson(
        "home_points",
        mu=home_theta,
        observed=df_all["home_score"],
        dims=("match"),
    )
    away_points = pm.Poisson(
        "away_points",
        mu=away_theta,
        observed=df_all["away_score"],
        dims=("match"),
    )
    trace = pm.sample(1000, tune=1500, cores=4)

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [home, sd_att, sd_def, intercept, atts_star, defs_star]

Sampling 4 chains for 1_500 tune and 1_000 draw iterations (6_000 + 4_000 draws total) took 17 seconds.

We specified the model and the likelihood function
All this runs on an PyTensor graph under the hood

az.plot_trace_dist(trace, var_names=["intercept", "home", "sd_att", "sd_def"], compact=False);

../_images/eb25870cacb420c97b3dfe179290d08d7603663a70945dfdd77380e1e0e4da26.png

Let us apply good statistical workflow practices and look at the various evaluation metrics to see if our NUTS sampler converged.

az.plot_energy(trace);

../_images/10c40d23acee42aa13b931a375519c49a65800a4790fbc874c61a91032fbc45e.png

az.summary(trace, kind="diagnostics")

	ess_bulk	ess_tail	r_hat	mcse_mean	mcse_sd
home	2646	2688	1.01	0.00075	0.00053
intercept	2543	2609	1.00	0.00059	0.00041
atts_star[England]	1308	1316	1.00	0.0042	0.0048
atts_star[France]	1339	1304	1.00	0.0041	0.0047
atts_star[Ireland]	1300	1285	1.00	0.0042	0.0047
atts_star[Italy]	1424	1470	1.00	0.0042	0.0046
atts_star[Scotland]	1335	1486	1.00	0.0042	0.0047
atts_star[Wales]	1273	1272	1.00	0.0042	0.0048
defs_star[England]	1307	1277	1.00	0.0062	0.0071
defs_star[France]	1283	1229	1.00	0.0062	0.0068
defs_star[Ireland]	1304	1191	1.00	0.0062	0.0068
defs_star[Italy]	1258	1134	1.00	0.0062	0.007
defs_star[Scotland]	1281	1149	1.00	0.0062	0.007
defs_star[Wales]	1257	1194	1.00	0.0063	0.007
sd_att	1848	1710	1.00	0.004	0.0065
sd_def	2329	1704	1.00	0.0054	0.01
atts[England]	4940	3252	1.00	0.00059	0.00041
atts[France]	4776	2451	1.00	0.00065	0.00047
atts[Ireland]	4497	3048	1.00	0.00063	0.00044
atts[Italy]	4605	3356	1.00	0.00079	0.00056
atts[Scotland]	4939	3090	1.00	0.00068	0.00048
atts[Wales]	4341	2861	1.00	0.00064	0.00046
defs[England]	4427	3211	1.00	0.00078	0.00054
defs[France]	4275	3387	1.00	0.0007	0.00049
defs[Ireland]	4571	3222	1.00	0.00081	0.00055
defs[Italy]	3869	3057	1.00	0.0006	0.00042
defs[Scotland]	4284	3496	1.00	0.00066	0.00047
defs[Wales]	4612	3399	1.00	0.00071	0.00051

Our model has converged well and \(\hat{R}\) looks good.

Let us look at some of the stats, just to verify that our model has returned the correct attributes. We can see that some teams are stronger than others. This is what we would expect with attack

trace_hdi = az.hdi(trace)
trace_hdi["atts"]

<xarray.DataArray 'atts' (team: 6, ci_bound: 2)> Size: 96B
array([[ 0.19360729,  0.32635809],
       [-0.15493215, -0.01367357],
       [ 0.03895445,  0.17310826],
       [-0.42224107, -0.25196746],
       [-0.18742169, -0.03475285],
       [ 0.10229217,  0.23516078]])
Coordinates:
  * team      (team) <U8 192B 'England' 'France' ... 'Scotland' 'Wales'
  * ci_bound  (ci_bound) <U5 40B 'lower' 'upper'

trace.posterior["atts"].median(("chain", "draw"))

<xarray.DataArray 'atts' (team: 6)> Size: 48B
array([ 0.2552428 , -0.08314481,  0.10774339, -0.33459167, -0.11661618,
        0.1719406 ])
Coordinates:
  * team     (team) <U8 192B 'England' 'France' 'Ireland' ... 'Scotland' 'Wales'

Results#

From the above we can start to understand the different distributions of attacking strength and defensive strength. These are probabilistic estimates and help us better understand the uncertainty in sports analytics

az.plot_forest(trace, var_names=["atts"], combined=True);

../_images/3db3712e2598f25a2fc7c46f626307c3937c2d474748716d428ed7f3f42be0ca.png

This is one of the powerful things about Bayesian modelling, we can have uncertainty quantification of our estimates. We’ve got a Bayesian credible interval for the attack strength of different countries.

We can see an overlap between Ireland, Wales and England which is what you’d expect since these teams have won in recent years.

Italy is well behind everyone else - which is what we’d expect and there’s an overlap between Scotland and France which seems about right.

There are probably some effects we’d like to add in here, like weighting more recent results more strongly. However that’d be a much more complicated model.

az.plot_forest(trace, var_names=["defs"], combined=True);

../_images/1a1711b05ce78ee5d65a1d232186ebc866730177b08875b3d1ea64bb2a4a5940.png

Good teams like Ireland and England have a strong negative effect defense. Which is what we expect. We expect our strong teams to have strong positive effects in attack and strong negative effects in defense.

This approach that we’re using of looking at parameters and examining them is part of a good statistical workflow. We also think that perhaps our priors could be better specified. However this is beyond the scope of this article. We recommend for a good discussion of ‘statistical workflow’ you visit Robust Statistical Workflow with RStan

Let’s do some other plots. So we can see our range for our defensive effect. I’ll print the teams below too just for reference

az.plot_dist(trace, var_names=["defs"]);

../_images/5153414d9d2299bbc5bf6650a214867346d62bb0c2ace2566219b648bfa03945.png

We can see that Ireland’s mean is -0.39 which means we expect Ireland to have a strong defense. Which is what we’d expect, Ireland generally even in games it loses doesn’t lose by say 50 points. And we can see that the 94% HDI is between -0.491, and -0.28

In comparison with Italy, we see a strong positive effect 0.58 mean and a HDI of 0.51 and 0.65. This means that we’d expect Italy to concede a lot of points, compared to what it scores. Given that Italy often loses by 30 - 60 points, this seems correct.

We see here also that this informs what other priors we could bring into this. We could bring some sort of world ranking as a prior.

As of December 2017 the rugby rankings indicate that England is 2nd in the world, Ireland 3rd, Scotland 5th, Wales 7th, France 9th and Italy 14th. We could bring that into a model and it can explain some of the fact that Italy is apart from a lot of the other teams.

Now let’s simulate who wins over a total of 4000 simulations, one per sample in the posterior.

with model:
    pm.sample_posterior_predictive(trace, extend_inferencedata=True)
pp = trace.posterior_predictive
const = trace.constant_data
team_da = trace.posterior.team

Sampling: [away_points, home_points]

The posterior predictive samples contain the goals scored by each team in each match. We modeled and therefore simulated according to scoring and devensive powers using goals as observed variable.

Our goal now is to see who wins the competition, so we can estimate the probability each team has of winning the whole competition. From that we need to convert the scored goals to points:

# fmt: off
pp["home_win"] = (
    (pp["home_points"] > pp["away_points"]) * 3     # home team wins and gets 3 points
    + (pp["home_points"] == pp["away_points"]) * 2  # tie -> home team gets 2 points
)
pp["away_win"] = (
    (pp["home_points"] < pp["away_points"]) * 3
    + (pp["home_points"] == pp["away_points"]) * 2
)
# fmt: on

Then add the points each team has collected throughout all matches:

groupby_sum_home = pp.home_win.groupby(team_da[const.home_team]).sum()
groupby_sum_away = pp.away_win.groupby(team_da[const.away_team]).sum()

pp["teamscores"] = groupby_sum_home + groupby_sum_away

And eventually generate the ranks of all teams for each of the 4000 simulations. As our data is stored in xarray objects inside the InferenceData class, we will use xarray-einstats:

from xarray_einstats.stats import rankdata

pp["rank"] = rankdata(-pp["teamscores"], dims="team", method="min")
pp["rank"].sel(team="England")

<xarray.DataArray 'rank' (chain: 4, draw: 1000)> Size: 32kB
array([[2, 2, 2, ..., 3, 2, 3],
       [3, 1, 1, ..., 2, 1, 2],
       [2, 2, 3, ..., 3, 2, 1],
       [3, 1, 2, ..., 3, 3, 1]], shape=(4, 1000))
Coordinates:
  * chain    (chain) int64 32B 0 1 2 3
  * draw     (draw) int64 8kB 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999
    team     <U7 28B 'England'

As you can see, we now have a collection of 4000 integers between 1 and 6 for each team, 1 meaning they win the competition. We can use a histogram with bin edges at half integers to count and normalize how many times each team finishes in each position:

from xarray_einstats.numba import histogram

bin_edges = np.arange(7) + 0.5
data_sim = (
    histogram(pp["rank"], dims=("chain", "draw"), bins=bin_edges, density=True)
    .rename({"bin": "rank"})
    .assign_coords(rank=np.arange(6) + 1)
)

Now that we have reduced the data to a 2 dimensional array, we will convert it to a pandas DataFrame which is now a more adequate choice to work with our data:

idx_dim, col_dim = data_sim.dims
sim_table = pd.DataFrame(data_sim, index=data_sim[idx_dim], columns=data_sim[col_dim])

fig, ax = plt.subplots(figsize=(8, 4))
ax = sim_table.T.plot(kind="barh", ax=ax)
ax.xaxis.set_major_formatter(StrMethodFormatter("{x:.1%}"))
ax.set_xlabel("Rank-wise Probability of results for all six teams")
ax.set_yticklabels(np.arange(1, 7))
ax.set_ylabel("Ranks")
ax.invert_yaxis()
ax.legend(loc="best", fontsize="medium");

../_images/8173b12f6d6630eb4eac4a72db16bab273762f4cbee854143772f1467c13fea3.png

We see according to this model that Ireland finishes with the most points about 60% of the time, and England finishes with the most points 45% of the time and Wales finishes with the most points about 10% of the time. (Note that these probabilities do not sum to 100% since there is a non-zero chance of a tie atop the table.)

As an Irish rugby fan - I like this model. However it indicates some problems with shrinkage, and bias. Since recent form suggests England will win.

Nevertheless the point of this model was to illustrate how a Hierarchical model could be applied to a sports analytics problem, and illustrate the power of PyMC.

Covariates#

We should do some exploration of the variables

az.plot_pair(
    trace,
    var_names=["atts"],
    visuals={"divergence": True},
);

../_images/beabd495a9e12a1d1f4d9100b4f5a775a973c7a059e08cbfb090963dd1f7ca1d.png

We observe that there isn’t a lot of correlation between these covariates, other than the weaker teams like Italy have a more negative distribution of these variables. Nevertheless this is a good method to get some insight into how the variables are behaving.

Authors#

Adapted Daniel Weitzenfeld’s blog post by Peadar Coyle. The original blog post was based on the work of [Baio and Blangiardo, 2010]
Updated by Meenal Jhajharia to use ArviZ and xarray
Updated by Oriol Abril-Pla to use PyMC v4 and xarray-einstats
Updated by Osvaldo Martin Dec 2025
Updated by Osvaldo Martin Apr 2026

References#

[1] (1,2)

Gianluca Baio and Marta Blangiardo. Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics, 37(2):253–264, 2010.

Watermark#

%load_ext watermark
%watermark -n -u -v -iv -w -p xarray,aeppl,numba,xarray_einstats

Last updated: Sat, 25 Apr 2026

Python implementation: CPython
Python version       : 3.14.4
IPython version      : 9.12.0

xarray         : 2026.4.0
aeppl          : unknown
numba          : 0.65.0
xarray_einstats: 0.10.0

arviz          : 1.1.0
matplotlib     : 3.10.8
numpy          : 2.4.4
pandas         : 3.0.2
pymc           : 5.28.0+58.gf58491a3
pytensor       : 2.38.0+133.g80cc113b5
seaborn        : 0.13.2
xarray_einstats: 0.10.0

Watermark: 2.6.0

License notice#

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples#

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Important

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

which once rendered could look like:

Categories

Tags