{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(bayes_param_survival)=\n", "\n", "# Bayesian Parametric Survival Analysis" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running on PyMC v5.16.2\n" ] } ], "source": [ "import warnings\n", "\n", "import arviz as az\n", "import numpy as np\n", "import pymc as pm\n", "import pytensor.tensor as pt\n", "import scipy as sp\n", "import seaborn as sns\n", "\n", "from matplotlib import pyplot as plt\n", "from matplotlib.ticker import StrMethodFormatter\n", "from statsmodels import datasets\n", "\n", "print(f\"Running on PyMC v{pm.__version__}\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "az.style.use(\"arviz-darkgrid\")\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Survival analysis](https://en.wikipedia.org/wiki/Survival_analysis) studies the distribution of the time between when a subject comes under observation and when that subject experiences an event of interest. One of the fundamental challenges of survival analysis (which also makes is mathematically interesting) is that, in general, not every subject will experience the event of interest before we conduct our analysis. In more concrete terms, if we are studying the time between cancer treatment and death (as we will in this post), we will often want to analyze our data before every subject has died. This phenomenon is called [censoring](https://en.wikipedia.org/wiki/Censoring_(statistics)) and is fundamental to survival analysis.\n", "\n", "\n", "This post illustrates a parametric approach to Bayesian survival analysis in PyMC. Parametric models of survival are simpler to both implement and understand than semiparametric models; statistically, they are also more [powerful](https://en.wikipedia.org/wiki/Power_(statistics)) than non- or semiparametric methods when they are correctly specified. For an example of a [semiparametric](https://en.wikipedia.org/wiki/Semiparametric_model) [Cox proportional hazards model](https://en.wikipedia.org/wiki/Proportional_hazards_model#The_Cox_model), you can read this [blogpost](http://austinrochford.com/posts/2015-10-05-bayes-survival.html), but be aware that the post used and old version of PyMC and that Implementing a semiparametric model in PyMC involved some fairly complex numpy code and nonobvious probability theory equivalences.\n", "\n", "We will analyze the [mastectomy data](https://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/mastectomy.html) from `R`'s [`HSAUR`](https://cran.r-project.org/web/packages/HSAUR/index.html) package. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sns.set()\n", "blue, green, red, purple, gold, teal = sns.color_palette(n_colors=6)\n", "\n", "pct_formatter = StrMethodFormatter(\"{x:.1%}\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | time | \n", "event | \n", "metastized | \n", "
---|---|---|---|
0 | \n", "23 | \n", "1.0 | \n", "0.0 | \n", "
1 | \n", "47 | \n", "1.0 | \n", "0.0 | \n", "
2 | \n", "69 | \n", "1.0 | \n", "0.0 | \n", "
3 | \n", "70 | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "100 | \n", "0.0 | \n", "0.0 | \n", "
<xarray.DataArray 'beta' ()> Size: 8B\n", "array(1.00442271)
<xarray.DataArray 'beta' ()> Size: 8B\n", "array(1.00301488)