{ "cells": [ { "cell_type": "markdown", "id": "a15cd228-d1cd-4d52-bc62-92aa975f798c", "metadata": {}, "source": [ "(interrupted_time_series)=\n", "# Interrupted time series analysis\n", "\n", ":::{post} Oct, 2022\n", ":tags: counterfactuals, causal inference, time series, forecasting, causal impact, quasi experiments\n", ":category: intermediate\n", ":author: Benjamin T. Vincent\n", ":::" ] }, { "cell_type": "markdown", "id": "1f342a5c-6d5b-4639-9655-a33cb92befab", "metadata": {}, "source": [ "This notebook focuses on how to conduct a simple Bayesian [interrupted time series analysis](https://en.wikipedia.org/wiki/Interrupted_time_series). This is useful in [quasi-experimental settings](https://en.wikipedia.org/wiki/Quasi-experiment) where an intervention was applied to all treatment units. \n", "\n", "For example, if a change to a website was made and you want to know the causal impact of the website change then _if_ this change was applied selectively and randomly to a test group of website users, then you may be able to make causal claims using the [A/B testing approach](https://en.wikipedia.org/wiki/A/B_testing).\n", "\n", "However, if the website change was rolled out to _all_ users of the website then you do not have a control group. In this case you do not have a direct measurement of the counterfactual, what _would have happened if_ the website change was not made. In this case, if you have data over a 'good' number of time points, then you may be able to make use of the interrupted time series approach.\n", "\n", "Interested readers are directed to the excellent textbook [The Effect](https://theeffectbook.net/) {cite:p}`huntington2021effect`. Chapter 17 covers 'event studies' which the author prefers to the interrupted time series terminology." ] }, { "cell_type": "markdown", "id": "8e5fdc46-1447-4b8a-af5f-d8fcef695858", "metadata": {}, "source": [ "## Causal DAG\n", "\n", "A simple causal DAG for the interrupted time series is given below, but see {cite:p}`huntington2021effect` for a more general DAG. In short it says:\n", "\n", "* The outcome is causally influenced by time (e.g. other factors that change over time) and by the treatment.\n", "* The treatment is causally influenced by time.\n", "\n", "\n", "\n", "Intuitively, we could describe the logic of the approach as:\n", "* We know that the outcome varies over time.\n", "* If we build a model of how the outcome varies over time _before_ the treatment, then we can predit the counterfactual of what we would expect to happen _if_ the treatment had not occurred.\n", "* We can compare this counterfactual with the observations from the time of the intervention onwards. If there is a meaningful discrepancy then we can attribute this as a causal impact of the intervention. \n", "\n", "This is reasonable if we have ruled out other plausible causes occurring at the same point in time as (or after) the intervention. This becomes more tricky to justify the more time has passed since the intervention because it is more likely that other relevant events maye have occurred that could provide alternative causal explanations.\n", "\n", "If this does not make sense immediately, I recommend checking the example data figure below then revisiting this section." ] }, { "cell_type": "code", "execution_count": 1, "id": "f9fbb462-3baf-4b8d-aad4-270bbd0a4018", "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import matplotlib.dates as mdates\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import xarray as xr\n", "\n", "from scipy.stats import norm" ] }, { "cell_type": "code", "execution_count": 2, "id": "861d3310-56d9-43cb-9baa-178e155eba3d", "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "RANDOM_SEED = 8927\n", "rng = np.random.default_rng(RANDOM_SEED)\n", "az.style.use(\"arviz-darkgrid\")" ] }, { "cell_type": "markdown", "id": "2bbd94c9-c102-4116-91a1-50fe396ca089", "metadata": {}, "source": [ "Now let's define some helper functions" ] }, { "cell_type": "code", "execution_count": 3, "id": "bdaae928-aabe-410d-b345-237a7c2361d2", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "def format_x_axis(ax, minor=False):\n", " # major ticks\n", " ax.xaxis.set_major_formatter(mdates.DateFormatter(\"%Y %b\"))\n", " ax.xaxis.set_major_locator(mdates.YearLocator())\n", " ax.grid(which=\"major\", linestyle=\"-\", axis=\"x\")\n", " # minor ticks\n", " if minor:\n", " ax.xaxis.set_minor_formatter(mdates.DateFormatter(\"%Y %b\"))\n", " ax.xaxis.set_minor_locator(mdates.MonthLocator())\n", " ax.grid(which=\"minor\", linestyle=\":\", axis=\"x\")\n", " # rotate labels\n", " for label in ax.get_xticklabels(which=\"both\"):\n", " label.set(rotation=70, horizontalalignment=\"right\")\n", "\n", "\n", "def plot_xY(x, Y, ax):\n", " quantiles = Y.quantile((0.025, 0.25, 0.5, 0.75, 0.975), dim=(\"chain\", \"draw\")).transpose()\n", "\n", " az.plot_hdi(\n", " x,\n", " hdi_data=quantiles.sel(quantile=[0.025, 0.975]),\n", " fill_kwargs={\"alpha\": 0.25},\n", " smooth=False,\n", " ax=ax,\n", " )\n", " az.plot_hdi(\n", " x,\n", " hdi_data=quantiles.sel(quantile=[0.25, 0.75]),\n", " fill_kwargs={\"alpha\": 0.5},\n", " smooth=False,\n", " ax=ax,\n", " )\n", " ax.plot(x, quantiles.sel(quantile=0.5), color=\"C1\", lw=3)\n", "\n", "\n", "# default figure sizes\n", "figsize = (10, 5)" ] }, { "cell_type": "markdown", "id": "a09f2651-5817-40c4-b19f-1b7478e6b167", "metadata": {}, "source": [ "## Generate data\n", "\n", "The focus of this example is on making causal claims using the interrupted time series approach. Therefore we will work with some relatively simple synthetic data which only requires a very simple model." ] }, { "cell_type": "code", "execution_count": 4, "id": "3b7480e8-d034-400b-98d5-09579532fa4f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | time | \n", "y | \n", "
---|---|---|
date | \n", "\n", " | \n", " |
2010-01-31 | \n", "0 | \n", "0.302698 | \n", "
2010-02-28 | \n", "1 | \n", "0.396772 | \n", "
2010-03-31 | \n", "2 | \n", "-0.433908 | \n", "
2010-04-30 | \n", "3 | \n", "0.276239 | \n", "
2010-05-31 | \n", "4 | \n", "0.943868 | \n", "
... | \n", "... | \n", "... | \n", "
2019-08-31 | \n", "115 | \n", "12.403634 | \n", "
2019-09-30 | \n", "116 | \n", "13.185151 | \n", "
2019-10-31 | \n", "117 | \n", "14.001674 | \n", "
2019-11-30 | \n", "118 | \n", "13.950388 | \n", "
2019-12-31 | \n", "119 | \n", "14.383951 | \n", "
120 rows × 2 columns
\n", "