{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(GLM-negative-binomial-regression)=\n", "# GLM: Negative Binomial Regression\n", "\n", ":::{post} September, 2023\n", ":tags: negative binomial regression, generalized linear model, \n", ":category: beginner\n", ":author: Ian Ozsvald, Abhipsha Das, Benjamin Vincent\n", ":::\n", "\n", ":::{include} ../extra_installs.md\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import seaborn as sns\n", "\n", "from scipy import stats" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "RANDOM_SEED = 8927\n", "rng = np.random.default_rng(RANDOM_SEED)\n", "\n", "%config InlineBackend.figure_format = \"retina\"\n", "az.style.use(\"arviz-darkgrid\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook closely follows the GLM Poisson regression example by [Jonathan Sedar](https://github.com/jonsedar) (which is in turn inspired by [a project by Ian Osvald](http://ianozsvald.com/2016/05/07/statistically-solving-sneezes-and-sniffles-a-work-in-progress-report-at-pydatalondon-2016/)) except the data here is negative binomially distributed instead of Poisson distributed.\n", "\n", "Negative binomial regression is used to model count data for which the variance is higher than the mean. The [negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution) can be thought of as a Poisson distribution whose rate parameter is gamma distributed, so that rate parameter can be adjusted to account for the increased variance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate Data\n", "\n", "As in the Poisson regression example, we assume that sneezing occurs at some baseline rate, and that consuming alcohol, not taking antihistamines, or doing both, increase its frequency.\n", "\n", "#### Poisson Data\n", "\n", "First, let's look at some Poisson distributed data from the Poisson regression example." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Mean Poisson values\n", "theta_noalcohol_meds = 1 # no alcohol, took an antihist\n", "theta_alcohol_meds = 3 # alcohol, took an antihist\n", "theta_noalcohol_nomeds = 6 # no alcohol, no antihist\n", "theta_alcohol_nomeds = 36 # alcohol, no antihist\n", "\n", "# Create samples\n", "q = 1000\n", "df_pois = pd.DataFrame(\n", " {\n", " \"nsneeze\": np.concatenate(\n", " (\n", " rng.poisson(theta_noalcohol_meds, q),\n", " rng.poisson(theta_alcohol_meds, q),\n", " rng.poisson(theta_noalcohol_nomeds, q),\n", " rng.poisson(theta_alcohol_nomeds, q),\n", " )\n", " ),\n", " \"alcohol\": np.concatenate(\n", " (\n", " np.repeat(False, q),\n", " np.repeat(True, q),\n", " np.repeat(False, q),\n", " np.repeat(True, q),\n", " )\n", " ),\n", " \"nomeds\": np.concatenate(\n", " (\n", " np.repeat(False, q),\n", " np.repeat(False, q),\n", " np.repeat(True, q),\n", " np.repeat(True, q),\n", " )\n", " ),\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanvar
nomedsalcohol
FalseFalse1.0471.127919
True2.9862.960765
TrueFalse5.9816.218858
True35.92936.064023
\n", "
" ], "text/plain": [ " mean var\n", "nomeds alcohol \n", "False False 1.047 1.127919\n", " True 2.986 2.960765\n", "True False 5.981 6.218858\n", " True 35.929 36.064023" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pois.groupby([\"nomeds\", \"alcohol\"])[\"nsneeze\"].agg([\"mean\", \"var\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the mean and variance of a Poisson distributed random variable are equal, the sample means and variances are very close.\n", "\n", "#### Negative Binomial Data\n", "\n", "Now, suppose every subject in the dataset had the flu, increasing the variance of their sneezing (and causing an unfortunate few to sneeze over 70 times a day). If the mean number of sneezes stays the same but variance increases, the data might follow a negative binomial distribution." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nsneezealcoholnomeds
02FalseFalse
11FalseFalse
22FalseFalse
32FalseFalse
43FalseFalse
............
399528TrueTrue
399663TrueTrue
399719TrueTrue
399834TrueTrue
399940TrueTrue
\n", "

4000 rows × 3 columns

\n", "
" ], "text/plain": [ " nsneeze alcohol nomeds\n", "0 2 False False\n", "1 1 False False\n", "2 2 False False\n", "3 2 False False\n", "4 3 False False\n", "... ... ... ...\n", "3995 28 True True\n", "3996 63 True True\n", "3997 19 True True\n", "3998 34 True True\n", "3999 40 True True\n", "\n", "[4000 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Gamma shape parameter\n", "alpha = 10\n", "\n", "\n", "def get_nb_vals(mu, alpha, size):\n", " \"\"\"Generate negative binomially distributed samples by\n", " drawing a sample from a gamma distribution with mean `mu` and\n", " shape parameter `alpha', then drawing from a Poisson\n", " distribution whose rate parameter is given by the sampled\n", " gamma variable.\n", "\n", " \"\"\"\n", "\n", " g = stats.gamma.rvs(alpha, scale=mu / alpha, size=size)\n", " return stats.poisson.rvs(g)\n", "\n", "\n", "# Create samples\n", "n = 1000\n", "df = pd.DataFrame(\n", " {\n", " \"nsneeze\": np.concatenate(\n", " (\n", " get_nb_vals(theta_noalcohol_meds, alpha, n),\n", " get_nb_vals(theta_alcohol_meds, alpha, n),\n", " get_nb_vals(theta_noalcohol_nomeds, alpha, n),\n", " get_nb_vals(theta_alcohol_nomeds, alpha, n),\n", " )\n", " ),\n", " \"alcohol\": np.concatenate(\n", " (\n", " np.repeat(False, n),\n", " np.repeat(True, n),\n", " np.repeat(False, n),\n", " np.repeat(True, n),\n", " )\n", " ),\n", " \"nomeds\": np.concatenate(\n", " (\n", " np.repeat(False, n),\n", " np.repeat(False, n),\n", " np.repeat(True, n),\n", " np.repeat(True, n),\n", " )\n", " ),\n", " }\n", ")\n", "df" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanvar
nomedsalcohol
FalseFalse0.9901.137037
True2.9834.030742
TrueFalse6.0048.658643
True35.918169.018294
\n", "