{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "(Bayesian Missing Data Imputation)=\n", "# Bayesian Missing Data Imputation\n", "\n", ":::{post} February, 2023\n", ":tags: missing data, bayesian imputation, hierarchical\n", ":category: advanced\n", ":author: Nathaniel Forde\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/nathanielforde/opt/miniconda3/envs/missing_data_clean/lib/python3.11/site-packages/pymc/sampling/jax.py:39: UserWarning: This module is experimental.\n", " warnings.warn(\"This module is experimental.\")\n" ] } ], "source": [ "import random\n", "\n", "import arviz as az\n", "import matplotlib.cm as cm\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import scipy.optimize\n", "\n", "from matplotlib.lines import Line2D\n", "from pymc.sampling.jax import sample_blackjax_nuts, sample_numpyro_nuts\n", "from scipy.stats import multivariate_normal" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Bayesian Imputation and Degrees of Missing-ness\n", "\n", "The analysis of data with missing values is a gateway into the study of causal inference. \n", "\n", "One of the key features of any analysis plagued by missing data is the assumption which governs the nature of the missing-ness i.e. what is the reason for gaps in our data? Can we ignore them? Should we worry about why? In this notebook we'll see an example of how to handle missing data using maximum likelihood estimation and bayesian imputation techniques. This will open up questions about the assumptions governing inference in the presence of missing data, and inference in counterfactual cases.\n", "\n", "We will make the discussion concrete by considering an example analysis of an employee satisfaction survey and how different work conditions contribute to the responses and non-responses we see in the data.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina' # high resolution figures\n", "az.style.use(\"arviz-darkgrid\")\n", "rng = np.random.default_rng(42)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Data Taxonomy\n", "\n", "Rubin's famous taxonomy breaks out the question into a choice of three fundamental options:\n", "\n", " - Missing Completely at Random (MCAR)\n", " - Missing at Random (MAR)\n", " - Missing Not at Random (MNAR)\n", "\n", "Each of these paradigms can be reduced to explicit definition in terms of the conditional probability regarding the **pattern of missing data**. The first pattern is the least concerning. The (MCAR) assumption states that the data are missing in a manner that is unrelated to both the observed and unobserved parts of the realised data. It is missing due to the haphazard circumstance of the world $\\phi$.\n", "\n", "$$P(M =1 | Y_{obs}, Y_{miss}, \\phi) = P(M =1 | \\phi)$$\n", "\n", "whereas the second pattern (MAR) allows that the reasons for missingness can be function of the observed data and circumstances of the world. Some times this is called a case of *ignorable* missingness because estimation can proceed in good faith on the basis of the observed data. There may be a loss of precision, but the inference should be sound. \n", "\n", "$$P(M =1 | Y_{obs}, Y_{miss}, \\phi) = P(M =1 | Y_{obs}, \\phi)$$ \n", "\n", "The most nefarious sort of missing data is when the missingness is a function of something outside the observed data, and the equation cannot be reduced further. Efforts at imputation and estimation more generally may become more difficulty in this final case because of the risk of confounding. This is a case of *non-ignorable* missing-ness. \n", "\n", "$$P(M =1 | Y_{obs}, Y_{miss}, \\phi)$$\n", "\n", "These assumptions are made before any analysis begins. They are inherently unverifiable. Your analysis will stand or fall depending on how plausible each assumption is in the context you seek to apply them. For example, an another type missing data results from systematic censoring as discussed in {ref}GLM-truncated-censored-regression. In such cases the reason for censoring governs the missing-ness pattern. \n", "\n", "## Employee Satisfaction Surveys\n", "\n", "We'll follow the presentation of Craig Enders' *Applied Missing Data Analysis* {cite:t}enders2022 and work with employee satisifaction data set. The data set comprises of a few composite measures reporting employee working conditions and satisfactions. Of particular note are empowerment (empower), work satisfaction (worksat) and two composite survey scores recording the employees leadership climate (climate), and the relationship quality with their supervisor lmx. \n", "\n", "The key question is what assumptions governs our patterns of missing data.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeeteamturnovermaleempowerlmxworksatclimatecohesion
0110.0132.011.03.018.03.5
1211.01NaN13.04.018.03.5
2311.0130.09.04.018.03.5
3411.0129.08.03.018.03.5
4511.0026.07.04.018.03.5
\n", "
" ], "text/plain": [ " employee team turnover male empower lmx worksat climate cohesion\n", "0 1 1 0.0 1 32.0 11.0 3.0 18.0 3.5\n", "1 2 1 1.0 1 NaN 13.0 4.0 18.0 3.5\n", "2 3 1 1.0 1 30.0 9.0 4.0 18.0 3.5\n", "3 4 1 1.0 1 29.0 8.0 3.0 18.0 3.5\n", "4 5 1 1.0 0 26.0 7.0 4.0 18.0 3.5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "try:\n", " df_employee = pd.read_csv(\"../data/employee.csv\")\n", "except FileNotFoundError:\n", " df_employee = pd.read_csv(pm.get_data(\"employee.csv\"))\n", "df_employee.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "worksat 0.047619\n", "empower 0.161905\n", "lmx 0.041270\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Percentage Missing\n", "df_employee[[\"worksat\", \"empower\", \"lmx\"]].isna().sum() / len(df_employee)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
worksatempowerlmx
0FalseFalseFalse
1FalseTrueFalse
2TrueTrueFalse
3FalseFalseTrue
4TrueFalseFalse
\n", "