{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(item_response_nba)=\n", "# NBA Foul Analysis with Item Response Theory\n", "\n", ":::{post} Apr 17, 2022\n", ":tags: hierarchical model, case study, generalized linear model \n", ":category: intermediate, tutorial\n", ":author: Austin Rochford, Lorenzo Toniazzi\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running on PyMC v4.0.0b6\n" ] } ], "source": [ "import os\n", "\n", "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "\n", "%matplotlib inline\n", "print(f\"Running on PyMC v{pm.__version__}\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "RANDOM_SEED = 8927\n", "rng = np.random.default_rng(RANDOM_SEED)\n", "az.style.use(\"arviz-darkgrid\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "This tutorial shows an application of Bayesian Item Response Theory {cite:p}fox2010bayesian to NBA basketball foul calls data using PyMC. Based on Austin Rochford's blogpost [NBA Foul Calls and Bayesian Item Response Theory](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html).\n", "\n", "### Motivation\n", "Our scenario is that we observe a binary outcome (a foul being called or not) from an interaction (a basketball play) of two agents with two different roles (the player committing the alleged foul and the player disadvantaged in the play). Moreover, each committing or disadvantaged agent is an individual which might be observed several times (say LeBron James observed committing a foul in more than one play). Then it might be that not only the agent's role, but also the abilities of the single individual player contribute to the observed outcome. And so we'd like to __estimate the contribution to the observed outcome of each individual's (latent) ability as a committing or disadvantaged agent.__ This would allow us, for example, to rank players from more to less effective, quantify uncertainty in this ranking and discover extra hierarchical structures involved in foul calls. All pretty useful stuff! \n", " \n", "\n", "So how can we study this common and complex __multi-agent interaction__ scenario, with __hierarchical__ structures between more than a thousand individuals? \n", "\n", "Despite the scenario's overwhelming complexity, Bayesian Item Response Theory combined with modern powerful statistical software allows for quite elegant and effective modeling options. One of these options employs a {term}Generalized Linear Model called [Rasch model](https://en.wikipedia.org/wiki/Rasch_model), which we now discuss in more detail.\n", "\n", "\n", "### Rasch Model\n", "We sourced our data from the official [NBA Last Two Minutes Reports](https://official.nba.com/2020-21-nba-officiating-last-two-minute-reports/) with game data between 2015 to 2021. In this dataset, each row k is one play involving two players (the committing and the disadvantaged) where a foul has been either called or not. So we model the probability p_k that a referee calls a foul in play k as a function of the players involved. Hence we define two latent variables for each player, namely:\n", "- theta: which estimates the player's ability to have a foul called when disadvantaged, and\n", "- b: which estimates the player's ability to have a foul not called when committing.\n", "\n", "Note that the higher these player's parameters, the better the outcome for the player's team. These two parameters are then estimated using a standard Rasch model, by assuming the log-odds-ratio of p_k equals theta-b for the corresponding players involved in play k. Also, we place hierarchical hyperpriors on all theta's and all b's to account for shared abilities between players and largely different numbers of observations for different players.\n", "\n", "\n", "### Discussion \n", "Our analysis gives an estimate of the latent skills theta and b for each player in terms of posterior distributions. We analyze this outcome in three ways. \n", "\n", "We first display the role of shared hyperpriors, by showing how posteriors of players with little observations are drawn to the league average.\n", "\n", "Secondly, we rank the posteriors by their mean to view best and worst committing and disadvantaged players, and observe that several players still rank in the top 10 of the same model estimated in [Austin Rochford blogpost](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html) on different data.\n", "\n", "Thirdly, we show how we spot that grouping payers by their position is likely to be an informative extra hierarchical layer to introduce in our model, and leave this as an exercise for the interested reader. Let us conclude by mentioning that this opportunity of easily adding informed hierarchical structure to a model is one of the features that makes Bayesian modelling very flexible and powerful for quantifying uncertainty in scenarios where introducing (or discovering) problem-specific knowledge is crucial.\n", "\n", "\n", "The analysis in this notebook is performed in four main steps: \n", "\n", "1. Data collection and processing.\n", "2. Definition and instantiation of the Rasch model. \n", "3. Posterior sampling and convergence checks.\n", "4. Analysis of the posterior results.\n", "\n", "## Data collection and processing\n", "We first import data from the original data set, which can be found at [this URL](https://raw.githubusercontent.com/polygraph-cool/last-two-minute-report/32f1c43dfa06c2e7652cc51ea65758007f2a1a01/output/all_games.csv). Each row corresponds to a play between the NBA seasons 2015-16 and 2020-21. We imported only five columns, namely\n", "- committing: the name of the committing player in the play.\n", "- disadvantaged: the name of the disadvantaged player in the play.\n", "- decision: the reviewed decision of the play, which can take four values, namely:\n", " - CNC: correct noncall, INC: incorrect noncall, IC: incorrect call, CC: correct call.\n", "- committing_position: the position of the committing player which can take values\n", " - G: guard, F: forward, C: center, G-F, F-G, F-C, C-F.\n", "- disadvantaged_position: the position of the disadvantaged player, with possible values as above.\n", "\n", "We note that we already removed from the original dataset the plays where less than two players are involved (for example travel calls or clock violations). Also, the original dataset does not contain information on the players' position, which we added ourselves." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
play_id
3Isaiah ThomasAndre IguodalaCNCGG-F
4Jae CrowderHarrison BarnesINCFF
6Draymond GreenIsaiah ThomasCCFG
\n", "
" ], "text/plain": [ " committing disadvantaged decision committing_position \\\n", "play_id \n", "1 Avery Bradley Stephen Curry CNC G \n", "3 Isaiah Thomas Andre Iguodala CNC G \n", "4 Jae Crowder Harrison Barnes INC F \n", "6 Draymond Green Isaiah Thomas CC F \n", "7 Avery Bradley Stephen Curry CC G \n", "\n", " disadvantaged_position \n", "play_id \n", "1 G \n", "3 G-F \n", "4 F \n", "6 G \n", "7 G " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "try:\n", " df_orig = pd.read_csv(os.path.join(\"..\", \"data\", \"item_response_nba.csv\"), index_col=0)\n", "except FileNotFoundError:\n", " df_orig = pd.read_csv(pm.get_data(\"item_response_nba.csv\"), index_col=0)\n", "df_orig.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now process our data in three steps:\n", " 1. We create a dataframe df by removing the position information from df_orig, and we create a dataframe df_position collecting all players with the respective position. (This last dataframe will not be used until the very end of the notebook.)\n", " 2. We add a column to df, called foul_called, that assigns 1 to a play if a foul was called, and 0 otherwise.\n", " 3. We assign IDs to committing and disadvantaged players and use this indexing to identify the respective players in each observed play.\n", "\n", "Finally, we display the head of our main dataframe df along with some basic statistics." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of observed plays: 46861\n", "Number of disadvantaged players: 770\n", "Number of committing players: 789\n", "Global probability of a foul being called: 23.3%\n", "\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
" ], "text/plain": [ " committing disadvantaged decision foul_called\n", "play_id \n", "1 Avery Bradley Stephen Curry CNC 0\n", "3 Isaiah Thomas Andre Iguodala CNC 0\n", "4 Jae Crowder Harrison Barnes INC 0\n", "6 Draymond Green Isaiah Thomas CC 1\n", "7 Avery Bradley Stephen Curry CC 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 1. Construct df and df_position\n", "df = df_orig[[\"committing\", \"disadvantaged\", \"decision\"]]\n", "\n", "df_position = pd.concat(\n", " [\n", " df_orig.groupby(\"committing\").committing_position.first(),\n", " df_orig.groupby(\"disadvantaged\").disadvantaged_position.first(),\n", " ]\n", ").to_frame()\n", "df_position = df_position[~df_position.index.duplicated(keep=\"first\")]\n", "df_position.index.name = \"player\"\n", "df_position.columns = [\"position\"]\n", "\n", "\n", "# 2. Create the binary foul_called variable\n", "def foul_called(decision):\n", " \"\"\"Correct and incorrect noncalls (CNC and INC) take value 0.\n", " Correct and incorrect calls (CC and IC) take value 1.\n", " \"\"\"\n", " out = 0\n", " if (decision == \"CC\") | (decision == \"IC\"):\n", " out = 1\n", " return out\n", "\n", "\n", "df = df.assign(foul_called=lambda df: df[\"decision\"].apply(foul_called))\n", "\n", "# 3 We index observed calls by committing and disadvantaged players\n", "committing_observed, committing = pd.factorize(df.committing, sort=True)\n", "disadvantaged_observed, disadvantaged = pd.factorize(df.disadvantaged, sort=True)\n", "df.index.name = \"play_id\"\n", "\n", "# Display of main dataframe with some statistics\n", "print(f\"Number of observed plays: {len(df)}\")\n", "print(f\"Number of disadvantaged players: {len(disadvantaged)}\")\n", "print(f\"Number of committing players: {len(committing)}\")\n", "print(f\"Global probability of a foul being called: \" f\"{100*round(df.foul_called.mean(),3)}%\\n\\n\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Item Response Model\n", "\n", "### Model definition\n", "\n", "We denote by:\n", "- $N_d$ and $N_c$ the number of disadvantaged and committing players, respectively, \n", "- $K$ the number of plays,\n", "- $k$ a play, \n", "- $y_k$ the observed call/noncall in play $k$,\n", "- $p_k$ the probability of a foul being called in play $k$,\n", "- $i(k)$ the disadvantaged player in play $k$, and by\n", "- $j(k)$ the committing player in play $k$.\n", "\n", "We assume that each disadvantaged player is described by the latent variable: \n", "- $\\theta_i$ for $i=1,2,...,N_d$,\n", "\n", "and each committing player is described by the latent variable: \n", "- $b_j$ for $j=1,2,...,N_c$.\n", "\n", "Then we model each observation $y_k$ as the result of an independent Bernoulli trial with probability $p_k$, where\n", "\n", "$$\n", "p_k =\\text{sigmoid}(\\eta_k)=\\left(1+e^{-\\eta_k}\\right)^{-1},\\quad\\text{with}\\quad \\eta_k=\\theta_{i(k)}-b_{j(k)},\n", "$$\n", "\n", "for $k=1,2,...,K$, by defining (via a [non-centered parametrisation](https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/))\n", "\n", "\\begin{align*}\n", "\\theta_{i}&= \\sigma_\\theta\\Delta_{\\theta,i}+\\mu_\\theta\\sim \\text{Normal}(\\mu_\\theta,\\sigma_\\theta^2), &i=1,2,...,N_d,\\\\\n", "b_{j}&= \\sigma_b\\Delta_{b,j}\\sim \\text{Normal}(0,\\sigma_b^2), &j=1,2,...,N_c,\n", "\\end{align*}\n", "\n", "with priors/hyperpriors\n", "\n", "\\begin{align*}\n", "\\Delta_{\\theta,i}&\\sim \\text{Normal}(0,1), &i=1,2,...,N_d,\\\\\n", "\\Delta_{b,j}&\\sim \\text{Normal}(0,1), &j=1,2,...,N_c,\\\\\n", "\\mu_\\theta&\\sim \\text{Normal}(0,100),\\\\\n", "\\sigma_\\theta &\\sim \\text{HalfCauchy}(2.5),\\\\\n", "\\sigma_b &\\sim \\text{HalfCauchy}(2.5).\n", "\\end{align*}\n", "\n", "Note that $p_k$ is always dependent on $\\mu_\\theta,\\,\\sigma_\\theta$ and $\\sigma_b$ (\"pooled priors\") and also depends on the actual players involved in the play due to $\\Delta_{\\theta,i}$ and $\\Delta_{b,j}$ (\"unpooled priors\"). This means our model features partial pooling. Morover, note that we do not pool $\\theta$'s with $b$'s, hence assuming these skills are independent even for the same player. Also, note that we normalised the mean of $b_{j}$ to zero. \n", "\n", "Finally, notice how we worked backwards from our data to construct this model. This is a very natural way to construct a model, allowing us to quickly see how each variable connects to others and their intuition. Meanwhile, when instantiating the model below, the construction goes in the opposite direction, i.e. starting from priors and moving up to the observations.\n", "\n", "### PyMC implementation\n", "We now implement the model above in PyMC. Note that, to easily keep track of the players (as we have hundreds of them being both committing and disadvantaged), we make use of the coords argument for {class}pymc.Model. (For tutorials on this functionality see the notebook {ref}data_container or [this blogpost](https://oriolabrilpla.cat/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html).) We choose our priors to be the same as in [Austin Rochford's post](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html), to make the comparison consistent." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "coords = {\"disadvantaged\": disadvantaged, \"committing\": committing}\n", "\n", "with pm.Model(coords=coords) as model:\n", " # Data\n", " foul_called_observed = pm.Data(\"foul_called_observed\", df.foul_called, mutable=False)\n", "\n", " # Hyperpriors\n", " mu_theta = pm.Normal(\"mu_theta\", 0.0, 100.0)\n", " sigma_theta = pm.HalfCauchy(\"sigma_theta\", 2.5)\n", " sigma_b = pm.HalfCauchy(\"sigma_b\", 2.5)\n", "\n", " # Priors\n", " Delta_theta = pm.Normal(\"Delta_theta\", 0.0, 1.0, dims=\"disadvantaged\")\n", " Delta_b = pm.Normal(\"Delta_b\", 0.0, 1.0, dims=\"committing\")\n", "\n", " # Deterministic\n", " theta = pm.Deterministic(\"theta\", Delta_theta * sigma_theta + mu_theta, dims=\"disadvantaged\")\n", " b = pm.Deterministic(\"b\", Delta_b * sigma_b, dims=\"committing\")\n", " eta = pm.Deterministic(\"eta\", theta[disadvantaged_observed] - b[committing_observed])\n", "\n", " # Likelihood\n", " y = pm.Bernoulli(\"y\", logit_p=eta, observed=foul_called_observed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now plot our model to show the hierarchical structure (and the non-centered parametrisation) on the variables theta and b." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pm.model_to_graphviz(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling and convergence\n", "\n", "We now sample from our Rasch model." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Auto-assigning NUTS sampler...\n", "Initializing NUTS using jitter+adapt_diag...\n", "Multiprocess sampling (4 chains in 4 jobs)\n", "NUTS: [mu_theta, sigma_theta, sigma_b, Delta_theta, Delta_b]\n" ] }, { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "