{ "cells": [ { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.043172, "end_time": "2021-02-23T11:26:55.064791", "exception": false, "start_time": "2021-02-23T11:26:55.021619", "status": "completed" }, "tags": [] }, "source": [ "(GLM-poisson-regression)=\n", "# GLM: Poisson Regression\n", "\n", ":::{post} November 30, 2022\n", ":tags: regression, poisson\n", ":category: intermediate\n", ":author: Jonathan Sedar, Benjamin Vincent\n", ":::" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.069202, "end_time": "2021-02-23T11:27:01.489628", "exception": false, "start_time": "2021-02-23T11:27:01.420426", "status": "completed" }, "tags": [] }, "source": [ "This is a minimal reproducible example of Poisson regression to predict counts using dummy data.\n", "\n", "This Notebook is basically an excuse to demo Poisson regression using PyMC, both manually and using bambi to demo interactions using the formulae library. We will create some dummy data, Poisson distributed according to a linear model, and try to recover the coefficients of that linear model through inference.\n", "\n", "For more statistical detail see:\n", "\n", "+ Basic info on [Wikipedia](https://en.wikipedia.org/wiki/Poisson_regression)\n", "+ GLMs: Poisson regression, exposure, and overdispersion in Chapter 6.2 of [ARM, Gelmann & Hill 2006](http://www.stat.columbia.edu/%7Egelman/arm/)\n", "+ This worked example from ARM 6.2 by [Clay Ford](http://www.clayford.net/statistics/poisson-regression-ch-6-of-gelman-and-hill/)\n", "\n", "This very basic model is inspired by [a project by Ian Osvald](http://ianozsvald.com/2016/05/07/statistically-solving-sneezes-and-sniffles-a-work-in-progress-report-at-pydatalondon-2016/), which is concerned with understanding the various effects of external environmental factors upon the allergic sneezing of a test subject." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#!pip install seaborn" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "papermill": { "duration": 6.051698, "end_time": "2021-02-23T11:27:01.160546", "exception": false, "start_time": "2021-02-23T11:26:55.108848", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import arviz as az\n", "import bambi as bmb\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import seaborn as sns\n", "\n", "from formulae import design_matrices" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "papermill": { "duration": 0.111837, "end_time": "2021-02-23T11:27:01.349763", "exception": false, "start_time": "2021-02-23T11:27:01.237926", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "RANDOM_SEED = 8927\n", "rng = np.random.default_rng(RANDOM_SEED)\n", "\n", "%config InlineBackend.figure_format = 'retina'\n", "az.style.use(\"arviz-darkgrid\")" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.06268, "end_time": "2021-02-23T11:27:01.615645", "exception": false, "start_time": "2021-02-23T11:27:01.552965", "status": "completed" }, "tags": [] }, "source": [ "## Local Functions" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.073451, "end_time": "2021-02-23T11:27:01.763249", "exception": false, "start_time": "2021-02-23T11:27:01.689798", "status": "completed" }, "tags": [] }, "source": [ "## Generate Data" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.060542, "end_time": "2021-02-23T11:27:01.884617", "exception": false, "start_time": "2021-02-23T11:27:01.824075", "status": "completed" }, "tags": [] }, "source": [ "This dummy dataset is created to emulate some data created as part of a study into quantified self, and the real data is more complicated than this. Ask Ian Osvald if you'd like to know more [@ianozvald](https://twitter.com/ianozsvald).\n", "\n", "\n", "### Assumptions:\n", "\n", "+ The subject sneezes N times per day, recorded as nsneeze (int)\n", "+ The subject may or may not drink alcohol during that day, recorded as alcohol (boolean)\n", "+ The subject may or may not take an antihistamine medication during that day, recorded as the negative action nomeds (boolean)\n", "+ We postulate (probably incorrectly) that sneezing occurs at some baseline rate, which increases if an antihistamine is not taken, and further increased after alcohol is consumed.\n", "+ The data is aggregated per day, to yield a total count of sneezes on that day, with a boolean flag for alcohol and antihistamine usage, with the big assumption that nsneezes have a direct causal relationship.\n", "\n", "\n", "Create 4000 days of data: daily counts of sneezes which are Poisson distributed w.r.t alcohol consumption and antihistamine usage" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "papermill": { "duration": 0.07367, "end_time": "2021-02-23T11:27:02.023323", "exception": false, "start_time": "2021-02-23T11:27:01.949653", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# decide poisson theta values\n", "theta_noalcohol_meds = 1 # no alcohol, took an antihist\n", "theta_alcohol_meds = 3 # alcohol, took an antihist\n", "theta_noalcohol_nomeds = 6 # no alcohol, no antihist\n", "theta_alcohol_nomeds = 36 # alcohol, no antihist\n", "\n", "# create samples\n", "q = 1000\n", "df = pd.DataFrame(\n", " {\n", " \"nsneeze\": np.concatenate(\n", " (\n", " rng.poisson(theta_noalcohol_meds, q),\n", " rng.poisson(theta_alcohol_meds, q),\n", " rng.poisson(theta_noalcohol_nomeds, q),\n", " rng.poisson(theta_alcohol_nomeds, q),\n", " )\n", " ),\n", " \"alcohol\": np.concatenate(\n", " (\n", " np.repeat(False, q),\n", " np.repeat(True, q),\n", " np.repeat(False, q),\n", " np.repeat(True, q),\n", " )\n", " ),\n", " \"nomeds\": np.concatenate(\n", " (\n", " np.repeat(False, q),\n", " np.repeat(False, q),\n", " np.repeat(True, q),\n", " np.repeat(True, q),\n", " )\n", " ),\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "papermill": { "duration": 0.093062, "end_time": "2021-02-23T11:27:02.176348", "exception": false, "start_time": "2021-02-23T11:27:02.083286", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nsneezealcoholnomeds
399526TrueTrue
399635TrueTrue
399736TrueTrue
399832TrueTrue
399935TrueTrue
\n", "
" ], "text/plain": [ " nsneeze alcohol nomeds\n", "3995 26 True True\n", "3996 35 True True\n", "3997 36 True True\n", "3998 32 True True\n", "3999 35 True True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.071086, "end_time": "2021-02-23T11:27:02.312429", "exception": false, "start_time": "2021-02-23T11:27:02.241343", "status": "completed" }, "tags": [] }, "source": [ "##### View means of the various combinations (Poisson mean values)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "papermill": { "duration": 0.082117, "end_time": "2021-02-23T11:27:02.449759", "exception": false, "start_time": "2021-02-23T11:27:02.367642", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nsneeze
nomedsFalseTrue
alcohol
False1.0475.981
True2.98635.929
\n", "