{ "cells": [ { "cell_type": "markdown", "id": "domestic-remove", "metadata": {}, "source": [ "(GLM-binomial-regression)=\n", "# Binomial regression\n", "\n", ":::{post} February, 2022\n", ":tags: binomial regression, generalized linear model, \n", ":category: beginner\n", ":author: Benjamin T. Vincent\n", ":::" ] }, { "cell_type": "markdown", "id": "72588976-efc3-4adc-bec2-bc5b6ac4b7e1", "metadata": {}, "source": [ "This notebook covers the logic behind [Binomial regression](https://en.wikipedia.org/wiki/Binomial_regression), a specific instance of Generalized Linear Modelling. The example is kept very simple, with a single predictor variable. \n", "\n", "It helps to recap logistic regression to understand when binomial regression is applicable. Logistic regression is useful when your outcome variable is a set of successes or fails, that is, a series of `0`, `1` observations. An example of this kind of outcome variable is \"Did you go for a run today?\" Binomial regression (aka aggregated binomial regression) is useful when you have a certain number of successes out of $n$ trials. So the example would be, \"How many days did you go for a run in the last 7 days?\" \n", "\n", "The observed data are a set of _counts_ of number of successes out of $n$ total trials. Many people might be tempted to reduce this data to a proportion, but this is not necessarily a good idea. For example, proportions are not directly measured, they are often best treated as latent variables to be estimated. Also, a proportion looses information: a proportion of 0.5 could respond to 1 run out of 2 days, or to 4 runs in the last 4 weeks, or many other things, but you have lost that information by paying attention to the proportion alone. \n", "\n", "The appropriate likelihood for binomial regression is the Binomial distribution:\n", "\n", "$$\n", "y_i \\sim \\text{Binomial}(n, p_i)\n", "$$\n", "\n", "where $y_i$ is a count of the number of successes out of $n$ trials, and $p_i$ is the (latent) probability of success. What we want to achieve with Binomial regression is to use a linear model to accurately estimate $p_i$ (i.e. $p_i = \\beta_0 + \\beta_1 \\cdot x_i$). So we could try to do this with a likelihood term like:\n", "\n", "$$\n", "y_i \\sim \\text{Binomial}(n, \\beta_0 + \\beta_1 \\cdot x_i)\n", "$$\n", "\n", "If we did this, we would quickly run into problems when the linear model generates values of $p$ outside the range of $0-1$. This is where the link function comes in:\n", "\n", "$$\n", "g(p_i) = \\beta_0 + \\beta_1 \\cdot x_i\n", "$$\n", "\n", "where $g()$ is a link function. This can be thought of as a transformation that maps proportions in the range $(0, 1)$ to the domain $(-\\infty, +\\infty)$. There are a number of potential functions that could be used, but a common one to use is the [Logit function](https://en.wikipedia.org/wiki/Logit).\n", "\n", "Although what we actually want to do is to rearrange this equation for $p_i$ so that we can enter it into the likelihood function. This results in:\n", "\n", "$$\n", "p_i= g^{-1}(\\beta_0 + \\beta_1 \\cdot x_i)\n", "$$\n", "\n", "where $g^{-1}()$ is the inverse of the link function, in this case the inverse of the Logit function (i.e. the [logistic sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) function, also known as the expit function). So if we enter this into our likelihood function we end up with:\n", "\n", "$$\n", "y_i \\sim \\text{Binomial}(n, \\text{InverseLogit}(\\beta_0 + \\beta_1 \\cdot x_i))\n", "$$\n", "\n", "This defines our likelihood function. All you need now to get some Bayesian Binomial regression done is priors over the $\\beta$ parameters. The observed data are $y_i$, $n$, and $x_i$." ] }, { "cell_type": "code", "execution_count": 1, "id": "elect-softball", "metadata": { "tags": [] }, "outputs": [], "source": [ "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "\n", "from scipy.special import expit" ] }, { "cell_type": "code", "execution_count": 2, "id": "level-balance", "metadata": { "tags": [] }, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "az.style.use(\"arviz-darkgrid\")\n", "rng = np.random.default_rng(1234)" ] }, { "cell_type": "markdown", "id": "sapphire-yellow", "metadata": {}, "source": [ "## Generate data" ] }, { "cell_type": "code", "execution_count": 3, "id": "aware-frontier", "metadata": {}, "outputs": [], "source": [ "# true params\n", "beta0_true = 0.7\n", "beta1_true = 0.4\n", "# number of yes/no questions\n", "n = 20\n", "\n", "sample_size = 30\n", "x = np.linspace(-10, 20, sample_size)\n", "# Linear model\n", "mu_true = beta0_true + beta1_true * x\n", "# transformation (inverse logit function = expit)\n", "p_true = expit(mu_true)\n", "# Generate data\n", "y = rng.binomial(n, p_true)\n", "# bundle data into dataframe\n", "data = pd.DataFrame({\"x\": x, \"y\": y})" ] }, { "cell_type": "markdown", "id": "imported-company", "metadata": {}, "source": [ "We can see that the underlying data $y$ is count data, out of $n$ total trials." ] }, { "cell_type": "code", "execution_count": 4, "id": "difficult-collaboration", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | x | \n", "y | \n", "
---|---|---|
0 | \n", "-10.000000 | \n", "3 | \n", "
1 | \n", "-8.965517 | \n", "1 | \n", "
2 | \n", "-7.931034 | \n", "3 | \n", "
3 | \n", "-6.896552 | \n", "1 | \n", "
4 | \n", "-5.862069 | \n", "2 | \n", "