Modeling Events#

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 09 - Modeling Events# Lecture 09 - Modeling Events

# Ignore warnings
import warnings

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import statsmodels.formula.api as smf
import utils as utils
import xarray as xr

from matplotlib import pyplot as plt
from matplotlib import style
from scipy import stats as stats

warnings.filterwarnings("ignore")

# Set matplotlib style
STYLE = "statistical-rethinking-2023.mplstyle"
style.use(STYLE)

UC Berkeley Admissions#

Dataset#

  • 4562 Graduate student applications

  • Stratified by

    • Department

    • Gender

Goal is to identify gender discrimination by admissions officers

ADMISSIONS = utils.load_data("UCBadmit")
ADMISSIONS
dept applicant.gender admit reject applications
1 A male 512 313 825
2 A female 89 19 108
3 B male 353 207 560
4 B female 17 8 25
5 C male 120 205 325
6 C female 202 391 593
7 D male 138 279 417
8 D female 131 244 375
9 E male 53 138 191
10 E female 94 299 393
11 F male 22 351 373
12 F female 24 317 341

Modeling Events#

  • Events: discrete, unordered outcomes

  • Observations are counts/aggregates

  • Unknowns are probabilities \(p\) of event ocurring, or odds of those probabilities \(\frac{p}{1-p}\)

  • All things we stratify by interact – generally never independent in real life

  • Often deal with the Log Odds of \(p = \log \left (\frac{p}{1-p} \right)\)

Admissions: Drawing the owl 🦉#

  1. Estimands(s)

  2. Scientific Models(s)

  3. Statistical Models(s)

  4. Analysis

1. Estimand(s)#

Which path defines “discrimination”#

Direct Discrimination (Causal Effect)#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"label": "Gender, G", "color": "red"},
        "D": {"label": "Department, D"},
        "A": {"label": "Admission rate, A", "color": "red"},
    },
    edge_props={("G", "A"): {"color": "red"}},
    graph_direction="LR",
)
../_images/36481e2592bbf6a5057aa0e742f873f7ec7751037be4a25d61ac4718e313d0ae.svg
  • aka “Institutional discrimination”

  • Referees are biased for or against a particular group

  • Usually the type we’re interested in identifying if it exists

  • Requires strong causal assumptions

Here, deparment, D is a mediator – this is a common structure in social sciences, where categorical status (e.g. gender) effects some mediating context (e.g. occupation), both of which affect a target outcome (wage). Examples

  • wage discrimination

  • hiring

  • scientific awards

Indirect Discrimination#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"label": "Gender, G", "color": "red"},
        "D": {"label": "Department, D"},
        "A": {"label": "Admission rate, A", "color": "red"},
    },
    edge_props={("G", "D"): {"color": "red"}, ("D", "A"): {"color": "red"}},
    graph_direction="LR",
)
../_images/1e4536009a97476089b1ae85736f832899d4fbe78d41aa172c087b3543e6a5f1.svg
  • aka “Structural discrimination”

  • e.g. Gender affects a person’s interests, and therefore the department they will apply to

  • Affects overall admission rates

  • Requires strong causal assumptions

Total Discrimination#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"label": "Gender, G", "color": "red"},
        "D": {"label": "Department, D"},
        "A": {"label": "Admission rate, A", "color": "red"},
    },
    edge_props={
        ("G", "D"): {"color": "red"},
        ("G", "A"): {"color": "red"},
        ("D", "A"): {"color": "red"},
    },
    graph_direction="LR",
)
../_images/9c7a23caa46f1b7648693b51fb9b7f4716a3c5310ac53daac5110781eb8d7bd8.svg
  • aka “Experienced discrimination”

  • through both direct and indirect routes

  • Requires mild assumptions

Estimands & Estimators#

  • Each of the different estimands require a different estimators

  • Often the thing we can estimate often isn’t what we want to estimate

  • e.g. Total discrimination may be easier to estimate, but is less actionable

Unobserved Confounds#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A"), ("U", "D"), ("U", "A")],
    node_props={"U": {"style": "dashed"}, "unobserved": {"style": "dashed"}},
    graph_direction="LR",
)
../_images/7fe9702ac0f014ba3f018166f8bc709d597ee153e0631977f57ab656b057fca6.svg

It’s always possible there are also confounds between the mediator and some unobserved confounds. We will ignore these for now.

2. Scientific Model(s):#

Simulate the process#

Below is a generative model of the review/admission process

np.random.seed(123)
n_samples = 1000

GENDER = stats.bernoulli.rvs(p=0.5, size=n_samples)

# Groundtruth parameters
# Gender 1 tends to apply to department 1
P_DEPARTMENT = np.where(GENDER == 0, 0.3, 0.8)

# Acceptance rates matrices -- Department x Gender
UNBIASED_ACCEPTANCE_RATES = np.array([[0.1, 0.1], [0.3, 0.3]])  # No *direct* gender bias

# Biased acceptance:
# - dept 0 accepts gender 0 at 50% of unbiased rate
# - dept 1 accepts gender 0 at 67% of unbiased rate
BIASED_ACCEPTANCE_RATES = np.array([[0.05, 0.1], [0.2, 0.3]])  # *direct* gender bias present

DEPARTMENT = stats.bernoulli.rvs(p=P_DEPARTMENT, size=n_samples).astype(int)


def simulate_admissions_data(bias_type):
    """Simulate admissions data using the global params above"""
    acceptance_rate = (
        BIASED_ACCEPTANCE_RATES if bias_type == "biased" else UNBIASED_ACCEPTANCE_RATES
    )
    acceptance = stats.bernoulli.rvs(p=acceptance_rate[DEPARTMENT, GENDER])

    return (
        pd.DataFrame(
            np.vstack([GENDER, DEPARTMENT, acceptance]).T,
            columns=["gender", "department", "accepted"],
        ),
        acceptance_rate,
    )


for bias_type in ["unbiased", "biased"]:

    fake_admissions, acceptance_rate = simulate_admissions_data(bias_type)

    gender_acceptance_counts = fake_admissions.groupby(["gender", "accepted"]).count()["department"]
    gender_acceptance_counts.name = None

    gender_department_counts = fake_admissions.groupby(["gender", "department"]).count()["accepted"]
    gender_department_counts.name = None

    observed_acceptance_rates = fake_admissions.groupby("gender").mean()["accepted"]
    observed_acceptance_rates.name = None

    print()
    print("-" * 30)
    print(bias_type.upper())
    print("-" * 30)
    print(f"Department Acceptance rate:\n{acceptance_rate}")
    print(f"\nGender-Department Frequency:\n{gender_department_counts}")
    print(f"\nGender-Acceptance Frequency:\n{gender_acceptance_counts}")
    print(f"\nOverall Acceptance Rates:\n{observed_acceptance_rates}")
------------------------------
UNBIASED
------------------------------
Department Acceptance rate:
[[0.1 0.1]
 [0.3 0.3]]

Gender-Department Frequency:
gender  department
0       0             335
        1             173
1       0             106
        1             386
dtype: int64

Gender-Acceptance Frequency:
gender  accepted
0       0           437
        1            71
1       0           359
        1           133
dtype: int64

Overall Acceptance Rates:
gender
0    0.139764
1    0.270325
dtype: float64

------------------------------
BIASED
------------------------------
Department Acceptance rate:
[[0.05 0.1 ]
 [0.2  0.3 ]]

Gender-Department Frequency:
gender  department
0       0             335
        1             173
1       0             106
        1             386
dtype: int64

Gender-Acceptance Frequency:
gender  accepted
0       0           465
        1            43
1       0           370
        1           122
dtype: int64

Overall Acceptance Rates:
gender
0    0.084646
1    0.247967
dtype: float64

Simulated data properties#

Both scenarios#

  • Gender 0 tends to apply to department 0

  • Gender 1 tends to apply to department 1

Unbiased scenario:#

  • due to lower acceptance rates in dept 0 and tendency of gender 0 to apply to dept 0, gender 0 has a lower overall rejection rate compared to gender 1

  • due to higher acceptance rates in dept 1 and tendency of gender 1 to apply to dept 1, gender 1 has a higher overall acceptance rate compared to gender 0

  • even in the case of no (direct) gender discrimination, there is still indirect discrimination based on tendency of genders to apply to different departments, and the unqual likelihood that each department accepts students.

Biased scenario#

  • overall acceptance rates are lower (due to baseline reduction in gender 0 acceptance rates across both departments)

  • in the scenario where there is actual department bias, we see a similar overall pattern of discrimination as the unbiased case due to the indirect effect.

3. Statistical Models#

Modeling Events#

  • We observe counts of events

  • We estimate probabilities – or, rather, the log-odds of events ocurring

Linear Models#

Expected value is linear (additive) combination of parameters

\[\begin{split} \begin{align*} Y_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_X X_i + ... \end{align*} \end{split}\]

b.c. Normal distribution is unbounded, so too is the expected value of the linear model.

Event Models#

Discrete events either occur, taking the value 1, or they do not, taking the value 0. This puts bounds on the expected value of an event. Namely the bounds are on the interval \((0, 1)\)

Priors for logistic regression#

logit link function is a harsh transform

  • Logit compresses parameter distributions

  • \(x > +4 \rightarrow \) event basically always occurs

  • \(x < -4 \rightarrow\) event basically never occurs

n_samples = 10000
fig, axs = plt.subplots(3, 2, figsize=(8, 8))
for ii, std in enumerate([10, 1.5, 1]):

    # Alpha prior distribution
    alphas = stats.norm(0, std).rvs(n_samples)
    plt.sca(axs[ii][0])
    az.plot_dist(alphas, color="C1")
    plt.xlim([-30, 30])
    plt.ylabel("density")
    if ii == 2:
        plt.xlabel("alpha")

    # Resulting event probability distribution
    ps = utils.invlogit(alphas)
    plt.sca(axs[ii][1])
    az.plot_dist(ps, color="C0")
    plt.xlim([0, 1])
    plt.ylabel("density")
    plt.title(f"$\\alpha \sim \mathcal{{N}}(0, {std})$")
    if ii == 2:
        plt.xlabel("p(event)")
../_images/89d0718aba6a7e06f1a8c1c8c4887c77c6e88c11846aa5859e6c7ae34c94f611.png

Prior Predictive Simulations#

# Demonstrating the effect of alpha / beta on p(x)
from functools import partial


def gaussian_2d_pdf(xs, ys, mean=(0, 0), covariance=np.eye(2)):
    return np.array(stats.multivariate_normal(mean, covariance).pdf(np.vstack([xs, ys]).T))


pdf = partial(gaussian_2d_pdf)

xs = ys = np.linspace(-3, 3, 100)
# utils.plot_2d_function(xs, ys, pdf, cmap='gray_r')


def plot_logistic_prior_predictive(n_samples=20, prior_std=0.5):

    _, axs = plt.subplots(1, 2, figsize=(10, 4))
    plt.sca(axs[0])
    min_x, max_x = -prior_std, prior_std
    xs = np.linspace(min_x, max_x, 100)
    ys = xs

    pdf = partial(gaussian_2d_pdf, covariance=np.array([[prior_std**2, 0], [0, prior_std**2]]))
    utils.plot_2d_function(xs, ys, pdf, cmap="gray_r")

    plt.axis("square")
    plt.xlim([min_x, max_x])
    plt.ylim([min_x, max_x])

    # Sample some parameters from prior
    βs = stats.norm.rvs(0, prior_std, size=n_samples)
    αs = stats.norm.rvs(0, prior_std, size=n_samples)

    for b, a in zip(βs, αs):
        plt.scatter(a, b)

    plt.xlabel("$\\alpha$")
    plt.ylabel("$\\beta$")
    plt.title(f"Samples from prior with std={prior_std}")

    # Resulting sigmoid functions
    plt.sca(axs[1])
    min_x, max_x = -3, 3
    xs = np.linspace(min_x, max_x, 100)
    for a, b in zip(αs, βs):
        logit_p = a + xs * b
        p = utils.invlogit(logit_p)
        plt.plot(xs, p)

    plt.xlabel("x")
    plt.ylabel("p = invlogit(x)")
    plt.xlim([-3, 3])
    plt.ylim([-0.05, 1.05])
    plt.axhline(0.5, linestyle="--", color="k")
    plt.title(f"Resulting Logistic Models")
plot_logistic_prior_predictive(prior_std=0.5)
../_images/acd14733c244c8b5e43958bc9c24b112559f2db5baa06bb008ec0d89b99ed116.png
plot_logistic_prior_predictive(prior_std=1.0)
../_images/98fbf3ff89e2c9255cb73c7c6cfc376dcb45e43038b88a1db2931b64be25f1d6.png
plot_logistic_prior_predictive(prior_std=10)
../_images/15f7fccb522dcca48bf0a4082e8a0fbfbeb3e42dbfe7de9dd406636c103a62de.png

Statistical models for admissions#

Again, the estimator will depend on the estimand

Total Causal Effect#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"color": "red"},
        "A": {"color": "red"},
    },
    edge_props={
        ("G", "A"): {"color": "red"},
        ("D", "A"): {"color": "red"},
        ("G", "D"): {"color": "red"},
    },
    graph_direction="LR",
)
../_images/75c2b72eec68ed3513e34cd78749c8d664e23266b22fe92d3fb83676b0c40b33.svg

Stratify by only Gender. Don’t stratify by Department b.c. it’s a Pipe (mediator) that we do not want to block

\[\begin{split} \begin{align*} A_i &\sim \text{Bernoulli}(p=p_i) \\ \text{logit}(p_i) &= \alpha[G_i] \\ \alpha &= [\alpha_0, \alpha_1] \\ \alpha_j &\sim \text{Normal}(0, 1) \end{align*} \end{split}\]

Direct Causal Effect#

utils.draw_causal_graph(
    edge_list=[("G", "D"), ("G", "A"), ("D", "A")],
    node_props={
        "G": {"color": "red"},
        "A": {"color": "red"},
    },
    edge_props={("G", "A"): {"color": "red"}},
    graph_direction="LR",
)
../_images/c24d42205a8c20380c913fa67ffafc5cd530c08b829324849d696ce9454db55d.svg

Stratify by Gender and Department to block the Pipe

\[\begin{split} \begin{align*} A_i &\sim \text{Bernoulli}(p_i) \\ \text{logit}(p_i) &= \alpha[G_i, D_i] \\ \alpha &= \left[{\begin{array}{cc} \alpha_{0,0}, \alpha_{0,1} \\ \alpha_{1,0}, \alpha_{1,1} \end{array}}\right] \\ \alpha_{j,k} &\sim \text{Normal}(0, 1) \end{align*} \end{split}\]

Fitting Total Causal Effect Model#

def fit_total_effect_model_admissions(data):
    """Fit total effect model, stratifying by gender.

    Note
    ----
    We use the Binomial regression, so simulated observation-level
    data from above must be pre-aggregated. (see the function
    `aggregate_admissions_data` below). We could have
    used a Bernoulli likelihood (Logistic regression) for the simulated
    data, but given that we only have aggregates for the real
    UC Berkeley data we choose this more general formulation.
    """

    # Set up observations / coords
    n_admitted = data["admit"].values
    n_applications = data["applications"].values

    gender_coded, gender = pd.factorize(data["applicant.gender"].values)

    with pm.Model(coords={"gender": gender}) as total_effect_model:

        # Mutable data for running any gender-based counterfactuals
        gender_coded = pm.Data("gender", gender_coded, dims="obs_ids")

        alpha = pm.Normal("alpha", 0, 1, dims="gender")

        # Record the inverse linked probabilities for reporting
        pm.Deterministic("p_accept", pm.math.invlogit(alpha), dims="gender")

        p = pm.math.invlogit(alpha[gender_coded])

        likelihood = pm.Binomial(
            "accepted", n=n_applications, p=p, observed=n_admitted, dims="obs_ids"
        )
        total_effect_inference = pm.sample()

    return total_effect_model, total_effect_inference


def aggregate_admissions_data(raw_admissions_data):
    """Aggregate simulated data from `simulate_admissions_data`, which
    is in long format to short format, by so that it can be modeled
    using Binomial likelihood. We also recode column names to have the
    same fields as the UC Berkeley dataset so that we can use the
    same model-fitting functions.
    """

    # Aggregate applications, admissions, and rejections. Rename
    # columns to have the  same fields as UC Berkeley dataset
    applications = raw_admissions_data.groupby(["gender", "department"]).count().reset_index()
    applications.rename(columns={"accepted": "applications"}, inplace=True)

    admits = raw_admissions_data.groupby(["gender", "department"]).sum().reset_index()
    admits.rename(columns={"accepted": "admit"}, inplace=True)

    data = applications.merge(admits)
    data.loc[:, "reject"] = data.applications - data.admit

    # Code gender & department. For easier comparison to lecture,
    # we use 1-indexed gender/department indicators like McElreath
    data.loc[:, "applicant.gender"] = data.gender.apply(lambda x: 2 if x else 1)
    data.loc[:, "dept"] = data.department.apply(lambda x: 2 if x else 1)

    return data[["applicant.gender", "dept", "applications", "admit", "reject"]]


def plot_admissions_model_posterior(inference, figsize=(6, 2)):
    _, ax = plt.subplots(figsize=figsize)

    # Plot alphas
    az.plot_forest(inference, var_names=["alpha"], combined=True, hdi_prob=0.89, ax=ax)

    # Plot inver-linked acceptance probabilities
    _, ax = plt.subplots(figsize=figsize)
    az.plot_forest(inference, var_names=["p_accept"], combined=True, hdi_prob=0.89, ax=ax)

    return az.summary(inference)

Fit Total Effect model on BIASED simulated admissions#

SIMULATED_BIASED_ADMISSIONS, _ = simulate_admissions_data("biased")
SIMULATED_BIASED_ADMISSIONS = aggregate_admissions_data(SIMULATED_BIASED_ADMISSIONS)
SIMULATED_BIASED_ADMISSIONS
applicant.gender dept applications admit reject
0 1 1 335 13 322
1 1 2 173 34 139
2 2 1 106 11 95
3 2 2 386 108 278
total_effect_model_simulated_biased, total_effect_inference_simulated_biased = (
    fit_total_effect_model_admissions(SIMULATED_BIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(total_effect_inference_simulated_biased, (6, 1.5))
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha[1] -2.240 0.151 -2.531 -1.958 0.002 0.002 3876.0 2626.0 1.0
alpha[2] -1.133 0.106 -1.335 -0.939 0.002 0.001 4108.0 3100.0 1.0
p_accept[1] 0.097 0.013 0.071 0.121 0.000 0.000 3876.0 2626.0 1.0
p_accept[2] 0.244 0.019 0.208 0.281 0.000 0.000 4108.0 3100.0 1.0
../_images/3c9cb7ead48aa4b6c88d0daa4b7b9c2679778c121ee9a9976dfe7c24fcccc2d3.png ../_images/a9e2d4919a9a882f6fd19fb886850176e6693db3b446d604c19209db646fdd06.png

Fit Total Effect model on UNBIASED simulated admissions#

SIMULATED_UNBIASED_ADMISSIONS, _ = simulate_admissions_data("unbiased")
SIMULATED_UNBIASED_ADMISSIONS = aggregate_admissions_data(SIMULATED_UNBIASED_ADMISSIONS)
SIMULATED_UNBIASED_ADMISSIONS
applicant.gender dept applications admit reject
0 1 1 335 33 302
1 1 2 173 60 113
2 2 1 106 8 98
3 2 2 386 111 275
total_effect_model_simulated_unbiased, total_effect_inference_simulated_unbiased = (
    fit_total_effect_model_admissions(SIMULATED_UNBIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(total_effect_inference_simulated_unbiased, (6, 1.5))
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha[1] -1.481 0.112 -1.694 -1.264 0.002 0.001 3700.0 2457.0 1.0
alpha[2] -1.132 0.107 -1.327 -0.937 0.002 0.001 4389.0 2864.0 1.0
p_accept[1] 0.186 0.017 0.150 0.215 0.000 0.000 3700.0 2457.0 1.0
p_accept[2] 0.244 0.020 0.210 0.281 0.000 0.000 4389.0 2864.0 1.0
../_images/279bf2f460bf6f9e76966003da9469b5cbbae6ae7193e0e9898eba1964f6ca89.png ../_images/1007e213a0dd112dfb22b6a9f0dabdf83e0613e63ef5aadefc05ce1b5b843b83.png

Fitting Direct Causal Effect Model#

def fit_direct_effect_model_admissions(data):
    """Fit total effect model, stratifying by gender.

    Note
    ----
    We use the Binomial likelihood, so simulated observation-level data from
    above must be pre-aggregated. (see the function `aggregate_admissions_data`
    below). We could have used a Bernoulli likelihood for the simulated data,
    but given that we only have aggregates for the real UC Berkeley data we
    choose this more general formulation.
    """

    # Set up observations / coords
    n_admitted = data["admit"].values
    n_applications = data["applications"].values

    department_coded, department = pd.factorize(data["dept"].values)
    gender_coded, gender = pd.factorize(data["applicant.gender"].values)

    with pm.Model(coords={"gender": gender, "department": department}) as direct_effect_model:

        # Mutable data for running any gender-based or department-based counterfactuals
        gender_coded = pm.Data("gender", gender_coded, dims="obs_ids")
        department_coded = pm.Data("department", department_coded, dims="obs_ids")
        n_applications = pm.Data("n_applications", n_applications, dims="obs_ids")

        alpha = pm.Normal("alpha", 0, 1, dims=["department", "gender"])

        # Record inverse linked probabilities for reporting
        pm.Deterministic("p_accept", pm.math.invlogit(alpha), dims=["department", "gender"])

        p = pm.math.invlogit(alpha[department_coded, gender_coded])

        likelihood = pm.Binomial(
            "accepted", n=n_applications, p=p, observed=n_admitted, dims="obs_ids"
        )
        direct_effect_inference = pm.sample()

    return direct_effect_model, direct_effect_inference

Fit Direct Effect model to BIASED simulated admissions data#

direct_effect_model_simulated_biased, direct_effect_inference_simulated_biased = (
    fit_direct_effect_model_admissions(SIMULATED_BIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(direct_effect_inference_simulated_biased)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha[1, 1] -3.021 0.256 -3.505 -2.563 0.003 0.002 7099.0 2936.0 1.0
alpha[1, 2] -1.997 0.278 -2.516 -1.488 0.003 0.003 6422.0 3414.0 1.0
alpha[2, 1] -1.371 0.188 -1.733 -1.034 0.002 0.002 7865.0 3270.0 1.0
alpha[2, 2] -0.938 0.114 -1.145 -0.722 0.001 0.001 7740.0 3283.0 1.0
p_accept[1, 1] 0.048 0.011 0.029 0.072 0.000 0.000 7099.0 2936.0 1.0
p_accept[1, 2] 0.123 0.029 0.067 0.176 0.000 0.000 6422.0 3414.0 1.0
p_accept[2, 1] 0.204 0.030 0.148 0.259 0.000 0.000 7865.0 3270.0 1.0
p_accept[2, 2] 0.282 0.023 0.239 0.324 0.000 0.000 7740.0 3283.0 1.0
../_images/b21b5a7e2f065c4faa42c90a6e7c859df726682fa82d2260da98370c366b4c4c.png ../_images/02be9cbdb6d5e5144e4aee07063258d7e5ef015d6c27e941288c6c27f9652c1b.png

For comparison, here’s the ground truth biased admission rates, which we’re able to mostly recover:

# Department x Gender
BIASED_ACCEPTANCE_RATES
array([[0.05, 0.1 ],
       [0.2 , 0.3 ]])

Fit Direct Effect model to UNBIASED simulated admissions data#

direct_effect_model_simulated_unbiased, direct_effect_inference_simulated_unbiased = (
    fit_direct_effect_model_admissions(SIMULATED_UNBIASED_ADMISSIONS)
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(direct_effect_inference_simulated_unbiased)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha[1, 1] -2.155 0.175 -2.460 -1.805 0.002 0.002 5264.0 3200.0 1.0
alpha[1, 2] -2.276 0.313 -2.901 -1.722 0.004 0.003 5684.0 2986.0 1.0
alpha[2, 1] -0.618 0.161 -0.919 -0.322 0.002 0.001 7013.0 3158.0 1.0
alpha[2, 2] -0.898 0.114 -1.114 -0.684 0.001 0.001 6859.0 3242.0 1.0
p_accept[1, 1] 0.105 0.016 0.076 0.136 0.000 0.000 5264.0 3200.0 1.0
p_accept[1, 2] 0.096 0.027 0.050 0.148 0.000 0.000 5684.0 2986.0 1.0
p_accept[2, 1] 0.351 0.036 0.284 0.419 0.000 0.000 7013.0 3158.0 1.0
p_accept[2, 2] 0.290 0.023 0.243 0.331 0.000 0.000 6859.0 3242.0 1.0
../_images/08221fe4839e60ee04c4fbdea91b44b4d10a1325e94b0b762d30595727cdddce.png ../_images/0c2eee666100ee3e46206bc624b42d8e8765d61b955f8660e736c325af76e62d.png

For comparison, here’s the ground truth unbiased admission rates, which were able to recover:

# Department x Gender
UNBIASED_ACCEPTANCE_RATES
array([[0.1, 0.1],
       [0.3, 0.3]])

4. Analyze the UC Berkeley Admissions data#

Review of the counts dataset#

  • we’ll use these counts data to model log odds of acceptance rate for gender/department

  • note that we’ll be using Binomial Regression, which is equivalent to Bernoulli regression, but operates on aggregate counts, as oppose to individual binary trials

    • The examples above were actually implemented as Binomial Regression so that we can re-use code and demonstrate general patterns of analysis

    • Either way, you’ll get the same inference using both approaches

ADMISSIONS
dept applicant.gender admit reject applications
1 A male 512 313 825
2 A female 89 19 108
3 B male 353 207 560
4 B female 17 8 25
5 C male 120 205 325
6 C female 202 391 593
7 D male 138 279 417
8 D female 131 244 375
9 E male 53 138 191
10 E female 94 299 393
11 F male 22 351 373
12 F female 24 317 341

Fit Total Effect model to UC Berkeley admissions data#

Don’t forget to look at diagnostics, which we’ll skip here

total_effect_model, total_effect_inference = fit_total_effect_model_admissions(ADMISSIONS)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(total_effect_inference, figsize=(6, 1.5))
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha[male] -0.221 0.038 -0.294 -0.151 0.001 0.000 4202.0 2936.0 1.0
alpha[female] -0.828 0.050 -0.917 -0.732 0.001 0.001 3905.0 2831.0 1.0
p_accept[male] 0.445 0.009 0.427 0.462 0.000 0.000 4202.0 2936.0 1.0
p_accept[female] 0.304 0.011 0.286 0.325 0.000 0.000 3905.0 2831.0 1.0
../_images/c27c438f4095bd50af410c6e9c3c954924e9b43ed36cbdd568fa2338d6bc1ce1.png ../_images/26007db02f8220015cd49c95c590e9aa040216e29d6ae0878528c1dd5d59a0f8.png

Total Causal Effect#

# Total Causal Effect
female_p_accept = total_effect_inference.posterior["p_accept"].sel(gender="female")
male_p_accept = total_effect_inference.posterior["p_accept"].sel(gender="male")
contrast = male_p_accept - female_p_accept

plt.subplots(figsize=(8, 3))
az.plot_dist(contrast)
plt.axvline(0, linestyle="--", color="black", label="No difference")
plt.xlabel("gender contrast (acceptance probability)")
plt.ylabel("density")
plt.title(
    "Total Causal Effect (Experienced Discrimination):\npositive values indicate male advantage"
)
plt.legend();
../_images/99b458765f16558672fb5499b0da78adf9a777177c864fb7375054534b147f88.png

Fit Direct Effect model to UC Berkeley admissions data#

direct_effect_model, direct_effect_inference = fit_direct_effect_model_admissions(ADMISSIONS)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
plot_admissions_model_posterior(direct_effect_inference, figsize=(6, 3))
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha[A, male] 0.490 0.072 0.353 0.620 0.001 0.001 5850.0 3067.0 1.0
alpha[A, female] 1.474 0.243 1.029 1.937 0.003 0.002 5579.0 2964.0 1.0
alpha[B, male] 0.531 0.087 0.377 0.703 0.001 0.001 5106.0 2554.0 1.0
alpha[B, female] 0.657 0.395 -0.075 1.387 0.005 0.004 6701.0 3242.0 1.0
alpha[C, male] -0.532 0.115 -0.743 -0.310 0.001 0.001 5940.0 3091.0 1.0
alpha[C, female] -0.660 0.086 -0.831 -0.506 0.001 0.001 5786.0 3051.0 1.0
alpha[D, male] -0.698 0.102 -0.892 -0.513 0.001 0.001 6055.0 3004.0 1.0
alpha[D, female] -0.618 0.108 -0.828 -0.425 0.001 0.001 6215.0 2943.0 1.0
alpha[E, male] -0.935 0.158 -1.239 -0.645 0.002 0.002 5672.0 2865.0 1.0
alpha[E, female] -1.146 0.119 -1.364 -0.928 0.002 0.001 5891.0 2814.0 1.0
alpha[F, male] -2.667 0.205 -3.057 -2.291 0.003 0.002 5224.0 2612.0 1.0
alpha[F, female] -2.499 0.195 -2.863 -2.131 0.003 0.002 6109.0 3036.0 1.0
p_accept[A, male] 0.620 0.017 0.587 0.650 0.000 0.000 5850.0 3067.0 1.0
p_accept[A, female] 0.811 0.037 0.742 0.878 0.000 0.000 5579.0 2964.0 1.0
p_accept[B, male] 0.629 0.020 0.595 0.670 0.000 0.000 5106.0 2554.0 1.0
p_accept[B, female] 0.653 0.086 0.488 0.805 0.001 0.001 6701.0 3242.0 1.0
p_accept[C, male] 0.370 0.027 0.322 0.423 0.000 0.000 5940.0 3091.0 1.0
p_accept[C, female] 0.341 0.019 0.303 0.376 0.000 0.000 5786.0 3051.0 1.0
p_accept[D, male] 0.333 0.023 0.291 0.374 0.000 0.000 6055.0 3004.0 1.0
p_accept[D, female] 0.351 0.025 0.304 0.395 0.000 0.000 6215.0 2943.0 1.0
p_accept[E, male] 0.283 0.032 0.222 0.341 0.000 0.000 5672.0 2865.0 1.0
p_accept[E, female] 0.242 0.022 0.204 0.283 0.000 0.000 5891.0 2814.0 1.0
p_accept[F, male] 0.066 0.012 0.043 0.089 0.000 0.000 5224.0 2612.0 1.0
p_accept[F, female] 0.077 0.014 0.052 0.103 0.000 0.000 6109.0 3036.0 1.0
../_images/5c5f9ff73e139d0eac6e0b45d4a70630054460c6b58ee7283f2105658a31feb6.png ../_images/3010d4b8d85dd77d24a3a3ee76719dffc9f8178a7c2ff8c7dd82daf042b3d746.png
# Fancier plot of department/gender acceptance probability distributions
_, ax = plt.subplots(figsize=(10, 5))
for ii, dept in enumerate(ADMISSIONS.dept.unique()):
    color = f"C{ii}"
    for gend in ADMISSIONS["applicant.gender"].unique():
        label = f"{dept}:{gend}"
        linestyle = "-." if gend == "female" else "-"
        post = direct_effect_inference.posterior["p_accept"].sel(department=dept, gender=gend)
        az.plot_dist(post, label=label, color=color, plot_kwargs={"linestyle": linestyle})
plt.xlabel("acceptance probability")
plt.legend(title="Deptartment:Gender");
../_images/415ead0f5eb46bcfc4e325e994a2b3c7037697f738f32d9ea967b25cf401601a.png

Direct Causal Effect#

# Direct Causal Effect
_, ax = plt.subplots(figsize=(8, 4))

for ii, dept in enumerate(ADMISSIONS.dept.unique()):
    color = f"C{ii}"
    label = f"{dept}"

    female_p_accept = direct_effect_inference.posterior["p_accept"].sel(
        gender="female", department=dept
    )
    male_p_accept = direct_effect_inference.posterior["p_accept"].sel(
        gender="male", department=dept
    )

    contrast = male_p_accept - female_p_accept
    az.plot_dist(contrast, color=color, label=label)

plt.xlabel("gender contrast (acceptance probability)")
plt.ylabel("density")
plt.title(
    "Direct Causal Effect (Institutional Discrimination):\npositive values indicate male advantage"
)
plt.axvline(0, linestyle="--", color="black", label="No difference")
plt.legend(title="Department");
../_images/6eabdde5bbcbad977e684de962381eabfad76709e45c1684722af59be8e99a79.png

Average Direct Effect#

  • The causal effect of gender averaged across departments (marginalize)

  • Depends on the distribution of applications to each deparment

  • This is easy to do as a simulation

Post-stratification#

  • Re-weight estimates for the target population

  • allows us to apply model fit from one university to estiamte causal impact at another university with a different distribution of departments

  • Here, we use the empirical distribution for re-weighting estimates

# Use the empirical distribution of departments -- we'd update this for a different university
applications_per_dept = ADMISSIONS.groupby("dept").sum()["applications"].values
total_applications = applications_per_dept.sum()
department_weight = applications_per_dept / total_applications

female_alpha = direct_effect_inference.posterior.sel(gender="female")["alpha"]
male_alpha = direct_effect_inference.posterior.sel(gender="male")["alpha"]

weighted_female_alpha = female_alpha * department_weight
weighed_male_alpha = male_alpha * department_weight

contrast = weighed_male_alpha - weighted_female_alpha

_, ax = plt.subplots(figsize=(8, 4))
az.plot_dist(contrast)

plt.axvline(0, linestyle="--", color="k", label="No Difference")
plt.xlabel("causal effect of perceived gender")
plt.ylabel("density")
plt.title("Average Direct Causal of Gender (perception),\npositive values indicate male advantage")
plt.xlim([-0.3, 0.3])
plt.legend();
../_images/bf5a715fc85760603d78e9013f8d45cfa7f6feaad8f4fd889e829847579c1444.png

To verify the averaging process, we can look at the contrast of the p_accept samples from the posterior, which provides similar results. However, looking ath the posterior obviously wouldn’t work for making predictions for an out-of-sample university however.

direct_posterior = direct_effect_inference.posterior

# Select by gender; note: p_accept already has link function applied
female_posterior = direct_posterior.sel(gender="female")[
    "p_accept"
]  # shape: (chain x draw x department)
male_posterior = direct_posterior.sel(gender="male")[
    "p_accept"
]  # shape: (chain x draw x department)
contrast = male_posterior - female_posterior  # shape: (chain x draw x department)

_, ax = plt.subplots(figsize=(8, 4))
# marginalize / collapse the contrast across all departments
az.plot_dist(contrast)
plt.axvline(0, linestyle="--", color="k", label="No Difference")
plt.xlim([-0.3, 0.3])
plt.xlabel("causal effect of perceived gender")
plt.ylabel("density")
plt.title("Posterior contrast of $p_{accept}$,\npositive values indicate male advantage")
plt.legend();
../_images/0761bc15a4341cc20369f2c055b92d42af311819f68a4990413aeba9e94ccca1.png

Discrimination?#

Hard to say

  • Big structural effects

  • Distribution of applicants could be due to discrimination

  • Confounds are likely

BONUS: Survival Analysis#

  • Counts often modeled as time-to-event (i.e. Exponential or Gamma distribution) – Goal is to estimate event rate

  • Tricky b.c. ignoring censored data can lead to inferential errors

    • Left censored: we don’t know when the timer started

    • Right censored the observation period finished before the event had time to happen

Example: Austin Cat Adoption#

  • 20k cats

  • Events: adopted (1) or not (0)

  • Censoring mechanisms

    • death before adoption

    • escape

    • not adopted yet

Goal: determine if Black are adopted at a lower rate than non-Black cats.

CATS = utils.load_data("AustinCats")
CATS.head()
id days_to_event date_out out_event date_in in_event breed color intake_age
0 A730601 1 07/08/2016 09:00:00 AM Transfer 07/07/2016 12:11:00 PM Stray Domestic Shorthair Mix Blue Tabby 7
1 A679549 25 06/16/2014 01:54:00 PM Transfer 05/22/2014 03:43:00 PM Stray Domestic Shorthair Mix Black/White 1
2 A683656 4 07/17/2014 04:57:00 PM Adoption 07/13/2014 01:20:00 PM Stray Snowshoe Mix Lynx Point 2
3 A709749 41 09/22/2015 12:49:00 PM Transfer 08/12/2015 06:29:00 PM Stray Domestic Shorthair Mix Calico 12
4 A733551 9 09/01/2016 12:00:00 AM Transfer 08/23/2016 02:35:00 PM Stray Domestic Shorthair Mix Brown Tabby/White 1

Modeling outcome variable: days_to_event#

Two go-to distributions for modeling time-to-event

Exponential Distribution:

  • simpler of the two (single parameter)

  • assumes constant rate

  • maximum entropy distribution amongst the set of non-negative continuous distributions that have the same average rate.

  • assumes a single thing to occur (e.g. part failure) before an event occurs (machine breakdown)

Gamma Distribution

  • more complex of the two (two parameters)

  • maximum entropy distribution amongst the set of distributions with the same mean and average log.

  • assumes multiple things to happen (e.g. part failures) before an event occurs (machine breakdown)

_, axs = plt.subplots(1, 2, figsize=(10, 4))

days = np.linspace(0, 100, 100)

plt.sca(axs[0])
for scale in [20, 50]:
    expon = stats.expon(loc=0.01, scale=scale)
    utils.plot_line(days, expon.pdf(days), label=f"$\\lambda$=1/{scale}")
plt.xlabel("days")
plt.ylabel("density")
plt.legend()
plt.title("Exponential Distribution")


plt.sca(axs[1])
N = 8
alpha = 10
for scale in [2, 5]:
    gamma = stats.gamma(a=10, loc=alpha, scale=scale)
    utils.plot_line(days, gamma.pdf(days), label=f"$\\alpha$={alpha}, $\\beta$={scale}")
plt.xlabel("days")
plt.ylabel("density")
plt.legend()
plt.title("Gamma Distribution");
../_images/16b63aef4e140181238b6520c95502c2655a139a9dd425bdde257647b2e35651.png

Censored and un-censored observations#

  • Observed data use the Cumulative distribution; i.e. the probability that the event occurred by time x

  • Unobserved (censored) data instead require the Complementary of the CDF; i.e. the probability that the event hasn’t happened yet.

xs = np.linspace(0, 5, 100)
_, axs = plt.subplots(1, 2, figsize=(10, 5))
plt.sca(axs[0])
cdfs = {}
for lambda_ in [0.5, 1]:
    cdfs[lambda_] = ys = 1 - np.exp(-lambda_ * xs)
    utils.plot_line(xs, ys, label=f"$\\lambda={lambda_}$")

plt.xlabel("x")
plt.ylabel("$1 - \\exp(-x)$")
plt.legend()
plt.title("CDF\np(event happened before or at time x| $\\lambda$)", fontsize=12)

plt.sca(axs[1])
for lambda_ in [0.5, 1]:
    ys = 1 - cdfs[lambda_]
    utils.plot_line(xs, ys, label=f"$\\lambda={lambda_}$")

plt.xlabel("x")
plt.ylabel("$\\exp(-x)$")
plt.legend()
plt.title("CCDF\np(event hasn't happened before or at time x| $\\lambda$)", fontsize=12);
../_images/884b739928a0c012c75e4266f451f73ed32799562971b0b7c9c837d9cee7c562.png

Statistical Model#

\[\begin{split} \begin{align*} D_i | A_i &= 1 \sim \text{Exponential}(\lambda_i) \\ D_i | A_i &= 0 \sim \text{ExponentialCCDF}(\lambda_i) \\ \lambda_i &= \frac{1}{\mu_i} \\ \log \mu_i &= \alpha_{\text{cat color}[i]} \\ \alpha_{Black, Other} &\sim \text{Exponential}(\gamma) \end{align*} \end{split}\]
  • \(D_i | A_i = 1\) - observed adoptions

  • \(D_i | A_i = 1\) - not-yet-observed adoptions

  • \(\alpha_{\text{cat color}[i]}\) log average time-to-adoption for each cat color

  • \(\log \mu_i\) – link function ensures \(\alpha\)s are positive

  • \(\lambda_i = \frac{1}{\mu_i}\) simplifies estimating average time-to-adoption

Finding reasonable hyperparameter for \(\alpha\)#

We’ll need to determine a reasonable data for the Exponential prior mean parameter \(\gamma\). To do so, we’ll look at the empirical distribution of time to adoption:

# determine reasonable prior for alpha
plt.subplots(figsize=(6, 3))
CATS[CATS.out_event == "Adoption"].days_to_event.hist(bins=100)
plt.xlim([0, 500]);
../_images/f4d7f8c47f46052747ae4143792e9d8ffd89fced4f2a1ffd8e4a55d3f0942e89.png

Using the above empirical historgram, we see that a majority of the probablity mass is between zero and 200, so let’s use 50 as the expected wait time.

CAT_COLOR_ID, CAT_COLOR = pd.factorize(CATS.color.apply(lambda x: "Other" if x != "Black" else x))
ADOPTED_ID, ADOPTED = pd.factorize(
    CATS.out_event.apply(lambda x: "Other" if x != "Adoption" else "Adopted")
)
DAYS_TO_ADOPTION = CATS.days_to_event.values.astype(float)
LAMBDA = 50

with pm.Model(coords={"cat_color": CAT_COLOR}) as adoption_model:

    # Censoring
    right_censoring = DAYS_TO_ADOPTION.copy()
    right_censoring[ADOPTED_ID == 0] = DAYS_TO_ADOPTION.max()

    # Priors
    gamma = 1 / LAMBDA
    alpha = pm.Exponential("alpha", gamma, dims="cat_color")

    # Likelihood
    log_adoption_rate = 1 / alpha[CAT_COLOR_ID]
    pm.Censored(
        "adopted",
        pm.Exponential.dist(lam=log_adoption_rate),
        lower=None,
        upper=right_censoring,
        observed=DAYS_TO_ADOPTION,
    )
    adoption_inference = pm.sample()
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 14 seconds.

Poor black kitties 🐈‍⬛#

It appears that black cats DO take longer to get adopted.

Posterior Summary#

_, ax = plt.subplots(figsize=(4, 2))
az.plot_forest(adoption_inference, var_names=["alpha"], combined=True, hdi_prob=0.89, ax=ax)
plt.xlabel("days to adoption");
../_images/7463a9a3cc421cbb443b994720e1bed8cee6605deb849895b6701f3fae3130d8.png

Posterior distributions#

for ii, cat_color in enumerate(["Black", "Other"]):
    color = "black" if cat_color == "Black" else "C0"
    posterior_alpha = adoption_inference.posterior.sel(cat_color=cat_color)["alpha"]
    az.plot_dist(posterior_alpha, color=color, label=cat_color)
plt.xlabel("waiting time")
plt.ylabel("density")
plt.legend();
../_images/c782345a63e71e4be9174b60c54c5ac3b0c49bb59391b20356de0530fa35b5e1.png
days_until_adoption_ = np.linspace(1, 100)
n_posterior_samples = 25

for cat_color in ["Black", "Other"]:
    color = "black" if cat_color == "Black" else "C0"
    posterior = adoption_inference.posterior.sel(cat_color=cat_color)
    alpha_samples = posterior.alpha.sel(chain=0)[:n_posterior_samples].values
    for ii, alpha in enumerate(alpha_samples):
        label = cat_color if ii == 0 else None
        plt.plot(
            days_until_adoption_,
            np.exp(-days_until_adoption_ / alpha),
            color=color,
            alpha=0.25,
            label=label,
        )

plt.xlabel("days until adoption")
plt.ylabel("fraction")
plt.legend();
../_images/df73d1638b5890cef1285abab12e483f79231d7598ea9d3ee59ef39348747888.png

Authors#

  • Ported to PyMC by Dustin Stansbury (2024)

  • Based on Statistical Rethinking (2023) lectures by Richard McElreath

%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,aeppl,xarray
Last updated: Tue Dec 17 2024

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

pytensor: 2.26.4
aeppl   : not installed
xarray  : 2024.7.0

statsmodels: 0.14.2
numpy      : 1.26.4
pandas     : 2.2.2
matplotlib : 3.9.2
scipy      : 1.14.1
xarray     : 2024.7.0
pymc       : 5.19.1
arviz      : 0.19.0

Watermark: 2.5.0

License notice#

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples#

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Important

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

which once rendered could look like: