{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Hierarchical model for Rugby prediction\n", "\n", ":::{post} 19 Mar, 2022\n", ":tags: hierarchical model, sports \n", ":category: intermediate, how-to\n", ":author: Peadar Coyle, Meenal Jhajharia, Oriol Abril-Pla\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we're going to reproduce the first model described in {cite:t}baio2010bayesian using PyMC. Then show how to sample from the posterior predictive to simulate championship outcomes from the scored goals which are the modeled quantities.\n", "\n", "We apply the results of the paper to the Six Nations Championship, which is a competition between Italy, Ireland, Scotland, England, France and Wales." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation\n", "Your estimate of the strength of a team depends on your estimates of the other strengths\n", "\n", "Ireland are a stronger team than Italy for example - but by how much?\n", "\n", "Source for Results 2014 are Wikipedia. I've added the subsequent years, 2015, 2016, 2017. Manually pulled from Wikipedia.\n", "\n", "* We want to infer a latent parameter - that is the 'strength' of a team based only on their **scoring intensity**, and all we have are their scores and results, we can't accurately measure the 'strength' of a team.\n", "* Probabilistic Programming is a brilliant paradigm for modeling these **latent** parameters\n", "* Aim is to build a model for the upcoming Six Nations in 2018." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{include} ../extra_installs.md\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sáb 02 abr 2022 03:24:55 EEST\r\n" ] } ], "source": [ "!date\n", "\n", "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import pytensor.tensor as pt\n", "import seaborn as sns\n", "\n", "from matplotlib.ticker import StrMethodFormatter\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "az.style.use(\"arviz-darkgrid\")\n", "plt.rcParams[\"figure.constrained_layout.use\"] = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a Rugby prediction exercise. So we'll input some data. We've taken this from Wikipedia and BBC sports." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "try:\n", " df_all = pd.read_csv(\"../data/rugby.csv\", index_col=0)\n", "except:\n", " df_all = pd.read_csv(pm.get_data(\"rugby.csv\"), index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What do we want to infer?\n", "\n", "* We want to infer the latent parameters (every team's strength) that are generating the data we observe (the scorelines).\n", "* Moreover, we know that the scorelines are a noisy measurement of team strength, so ideally, we want a model that makes it easy to quantify our uncertainty about the underlying strengths.\n", "* Often we don't know what the Bayesian Model is explicitly, so we have to 'estimate' the Bayesian Model'\n", "* If we can't solve something, approximate it.\n", "* Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.\n", "* Fortunately, this algorithm can be applied to almost any model.\n", "\n", "## What do we want?\n", "\n", "* We want to quantify our uncertainty\n", "* We want to also use this to generate a model\n", "* We want the answers as distributions not point estimates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualization/EDA\n", "We should do some some exploratory data analysis of this dataset.\n", "\n", "The plots should be fairly self-explantory, we'll look at things like difference between teams in terms of their scores." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_scoreaway_scoreyear
count60.00000060.00000060.000000
mean23.50000019.9833332015.500000
std14.01996212.9110281.127469
min0.0000000.0000002014.000000
25%16.00000010.0000002014.750000
50%20.50000018.0000002015.500000
75%27.25000023.2500002016.250000
max67.00000063.0000002017.000000
\n", "
" ], "text/plain": [ " home_score away_score year\n", "count 60.000000 60.000000 60.000000\n", "mean 23.500000 19.983333 2015.500000\n", "std 14.019962 12.911028 1.127469\n", "min 0.000000 0.000000 2014.000000\n", "25% 16.000000 10.000000 2014.750000\n", "50% 20.500000 18.000000 2015.500000\n", "75% 27.250000 23.250000 2016.250000\n", "max 67.000000 63.000000 2017.000000" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_all.describe()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_teamaway_teamhome_scoreaway_scoreyear
55ItalyFrance18402017
56EnglandScotland61212017
57ScotlandItaly2902017
58FranceWales20182017
59IrelandEngland1392017
\n", "