{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(GLM-simpsons-paradox)=\n", "# Simpson's paradox and mixed models\n", "\n", ":::{post} March, 2022\n", ":tags: regression, hierarchical model, linear model, posterior predictive, Simpson's paradox \n", ":category: beginner\n", ":author: Benjamin T. Vincent\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook covers:\n", "- [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) and its resolution through mixed or hierarchical models. This is a situation where there might be a negative relationship between two variables within a group, but when data from multiple groups are combined, that relationship may disappear or even reverse sign. The gif below (from the [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) Wikipedia page) demonstrates this very nicely.\n", "- How to build linear regression models, starting with linear regression, moving up to hierarchical linear regression. Simpon's paradox is a nice motivation for why we might want to do this - but of course we should aim to build models which incorporate as much as our knowledge about the structure of the data (e.g. it's nested nature) as possible.\n", "- Use of pm.Data containers to facilitate posterior prediction at different $x$ values with the same model.\n", "- Providing array dimensions (see coords) to models to help with shape problems. This involves the use of [xarray](http://xarray.pydata.org/) and is quite helpful in multi-level / hierarchical models.\n", "- Differences between posteriors and posterior predictive distributions.\n", "- How to visualise models in data space and parameter space, using a mixture of [ArviZ](https://arviz-devs.github.io/arviz/) and [matplotlib](https://matplotlib.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://upload.wikimedia.org/wikipedia/commons/f/fb/Simpsons_paradox_-_animation.gif)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import xarray as xr" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "az.style.use(\"arviz-darkgrid\")\n", "rng = np.random.default_rng(1234)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate data\n", "\n", "This data generation was influenced by this [stackexchange](https://stats.stackexchange.com/questions/479201/understanding-simpsons-paradox-with-random-effects) question." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "def generate():\n", " group_list = [\"one\", \"two\", \"three\", \"four\", \"five\"]\n", " trials_per_group = 20\n", " group_intercepts = rng.normal(0, 1, len(group_list))\n", " group_slopes = np.ones(len(group_list)) * -0.5\n", " group_mx = group_intercepts * 2\n", " group = np.repeat(group_list, trials_per_group)\n", " subject = np.concatenate(\n", " [np.ones(trials_per_group) * i for i in np.arange(len(group_list))]\n", " ).astype(int)\n", " intercept = np.repeat(group_intercepts, trials_per_group)\n", " slope = np.repeat(group_slopes, trials_per_group)\n", " mx = np.repeat(group_mx, trials_per_group)\n", " x = rng.normal(mx, 1)\n", " y = rng.normal(intercept + (x - mx) * slope, 1)\n", " data = pd.DataFrame({\"group\": group, \"group_idx\": subject, \"x\": x, \"y\": y})\n", " return data, group_list" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data, group_list = generate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To follow along, it is useful to clearly understand the form of the data. This is [long form](https://en.wikipedia.org/wiki/Wide_and_narrow_data) data (also known as narrow data) in that each row represents one observation. We have a group column which has the group label, and an accompanying numerical group_idx column. This is very useful when it comes to modelling as we can use it as an index to look up group-level parameter estimates. Finally, we have our core observations of the predictor variable x and the outcome y." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupgroup_idxxy
0one0-0.294574-2.338519
1one0-4.686497-1.448057
2one0-2.262201-1.393728
3one0-4.873809-0.265403
4one0-2.863929-0.774251
...............
95five43.9814130.467970
96five41.8891020.553290
97five42.5612672.590966
98five40.1473782.050944
99five42.7380730.517918
\n", "

100 rows × 4 columns

\n", "