{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(GLM-simpsons-paradox)=\n", "# Simpson's paradox and mixed models\n", "\n", ":::{post} March, 2022\n", ":tags: regression, hierarchical model, linear model, posterior predictive, Simpson's paradox \n", ":category: beginner\n", ":author: Benjamin T. Vincent\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook covers:\n", "- [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) and its resolution through mixed or hierarchical models. This is a situation where there might be a negative relationship between two variables within a group, but when data from multiple groups are combined, that relationship may disappear or even reverse sign. The gif below (from the [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) Wikipedia page) demonstrates this very nicely.\n", "- How to build linear regression models, starting with linear regression, moving up to hierarchical linear regression. Simpon's paradox is a nice motivation for why we might want to do this - but of course we should aim to build models which incorporate as much as our knowledge about the structure of the data (e.g. it's nested nature) as possible.\n", "- Use of `pm.Data` containers to facilitate posterior prediction at different $x$ values with the same model.\n", "- Providing array dimensions (see `coords`) to models to help with shape problems. This involves the use of [xarray](http://xarray.pydata.org/) and is quite helpful in multi-level / hierarchical models.\n", "- Differences between posteriors and posterior predictive distributions.\n", "- How to visualise models in data space and parameter space, using a mixture of [ArviZ](https://arviz-devs.github.io/arviz/) and [matplotlib](https://matplotlib.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import xarray as xr" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "az.style.use(\"arviz-darkgrid\")\n", "rng = np.random.default_rng(1234)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate data\n", "\n", "This data generation was influenced by this [stackexchange](https://stats.stackexchange.com/questions/479201/understanding-simpsons-paradox-with-random-effects) question." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "def generate():\n", " group_list = [\"one\", \"two\", \"three\", \"four\", \"five\"]\n", " trials_per_group = 20\n", " group_intercepts = rng.normal(0, 1, len(group_list))\n", " group_slopes = np.ones(len(group_list)) * -0.5\n", " group_mx = group_intercepts * 2\n", " group = np.repeat(group_list, trials_per_group)\n", " subject = np.concatenate(\n", " [np.ones(trials_per_group) * i for i in np.arange(len(group_list))]\n", " ).astype(int)\n", " intercept = np.repeat(group_intercepts, trials_per_group)\n", " slope = np.repeat(group_slopes, trials_per_group)\n", " mx = np.repeat(group_mx, trials_per_group)\n", " x = rng.normal(mx, 1)\n", " y = rng.normal(intercept + (x - mx) * slope, 1)\n", " data = pd.DataFrame({\"group\": group, \"group_idx\": subject, \"x\": x, \"y\": y})\n", " return data, group_list" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data, group_list = generate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To follow along, it is useful to clearly understand the form of the data. This is [long form](https://en.wikipedia.org/wiki/Wide_and_narrow_data) data (also known as narrow data) in that each row represents one observation. We have a `group` column which has the group label, and an accompanying numerical `group_idx` column. This is very useful when it comes to modelling as we can use it as an index to look up group-level parameter estimates. Finally, we have our core observations of the predictor variable `x` and the outcome `y`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | group | \n", "group_idx | \n", "x | \n", "y | \n", "
---|---|---|---|---|
0 | \n", "one | \n", "0 | \n", "-0.294574 | \n", "-2.338519 | \n", "
1 | \n", "one | \n", "0 | \n", "-4.686497 | \n", "-1.448057 | \n", "
2 | \n", "one | \n", "0 | \n", "-2.262201 | \n", "-1.393728 | \n", "
3 | \n", "one | \n", "0 | \n", "-4.873809 | \n", "-0.265403 | \n", "
4 | \n", "one | \n", "0 | \n", "-2.863929 | \n", "-0.774251 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
95 | \n", "five | \n", "4 | \n", "3.981413 | \n", "0.467970 | \n", "
96 | \n", "five | \n", "4 | \n", "1.889102 | \n", "0.553290 | \n", "
97 | \n", "five | \n", "4 | \n", "2.561267 | \n", "2.590966 | \n", "
98 | \n", "five | \n", "4 | \n", "0.147378 | \n", "2.050944 | \n", "
99 | \n", "five | \n", "4 | \n", "2.738073 | \n", "0.517918 | \n", "
100 rows × 4 columns
\n", "