Diagnosing Biased Inference with Divergences#

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

print(f"Running on PyMC v{pm.__version__}")
Running on PyMC v5.28.0+58.gf58491a3
%config InlineBackend.figure_format = 'retina'
az.style.use("arviz-variat")
SEED = 20134234

This notebook is inspired by Michael Betancourt’s post on mc-stan, but we have adapted to reflect improvements in diagnostics since then and to show best practices. For discussion on the theory behind divergences and how they relate to biased inference, you can read A Conceptual Introduction to Hamiltonian Monte Carlo.

Bayesian statistics is all about building a model and estimating the parameters in that model. However, a naive or direct parameterization of our probability model can sometimes be ineffective, you can check out Thomas Wiecki’s blog post, Why hierarchical models are awesome, tricky, and Bayesian on the same issue in PyMC. Suboptimal parameterization often leads to slow sampling, and more problematic, biased MCMC estimators.

More formally, as explained in the original post, Diagnosing Biased Inference with Divergences:

Markov chain Monte Carlo (MCMC) approximates expectations with respect to a given target distribution,

\(\mathbb{E}{\pi} [ f ] = \int \mathrm{d}q \, \pi (q) \, f(q)\)

using the states of a Markov chain, \({q{0}, \ldots, q_{N} }\),

\(\mathbb{E}{\pi} [ f ] \approx \hat{f}{N} = \frac{1}{N + 1} \sum_{n = 0}^{N} f(q_{n})\)

These estimators, however, are guaranteed to be accurate only asymptotically as the chain grows to be infinitely long,

\(\lim_{N \rightarrow \infty} \hat{f}{N} = \mathbb{E}{\pi} [ f ]\)

To be useful in applied analyses, we need MCMC to converge to the true expectation values before we exhaust our finite computational resources. This fast convergence requires strong ergodicity conditions to hold, in particular geometric ergodicity between a Markov transition and a target distribution. Geometric ergodicity is usually the necessary condition for MCMC estimators to follow a central limit theorem, which ensures not only that they are unbiased even after only a finite number of iterations but also that we can empirically quantify their precision using the Monte Carlo standard error (mcse).

Unfortunately, proving geometric ergodicity is infeasible for any nontrivial problem. Instead, we must rely on empirical diagnostics that identify obstructions to geometric ergodicity, and hence well-behaved MCMC estimators. For a general Markov transition and target distribution, the best known diagnostic is \(\hat{R}\) statistic over an ensemble of Markov chains initialized from diffuse points in parameter space; to do any better we need to exploit the particular structure of a given transition or target distribution.

Hamiltonian Monte Carlo, for example, is especially powerful in this regard as its failures to be geometrically ergodic with respect to any target distribution manifest in distinct behaviors that have been developed into sensitive diagnostics. One of these behaviors is the appearance of divergences that indicate the Hamiltonian Markov chain has encountered regions of high curvature in the target distribution which it cannot adequately explore.

In this notebook we aim to identify divergences and the underlying pathologies in PyMC.

The Eight Schools Model#

The hierarchical model of the Eight Schools dataset (Rubin 1981) as seen in Stan:

\(\mu \sim \mathcal{N}(0, 5)\\\) \(\tau \sim \text{Half-Cauchy}(0, 5)\\\) \(\theta_{n} \sim \mathcal{N}(\mu, \tau)\\\) \(y_{n} \sim \mathcal{N}(\theta_{n}, \sigma_{n}),\)

where \(n \in \{1, \ldots, 8 \}\) and the \(\{ y_{n}, \sigma_{n} \}\) are given as data.

Inferring the hierarchical hyperparameters, \(\mu\) and \(\sigma\), together with the group-level parameters, \(\theta_{1}, \ldots, \theta_{8}\), allows the model to pool data across the groups and reduce their posterior variance. Unfortunately, the direct centered parameterization also squeezes the posterior distribution into a particularly challenging geometry that obstructs geometric ergodicity and hence biases MCMC estimation.

# Data of the Eight Schools Model
J = 8
y = np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0])
sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0])
# tau = 25.
with pm.Model() as Centered_eight:
    mu = pm.Normal("mu", mu=0, sigma=5)
    tau = pm.HalfCauchy("tau", beta=5)
    theta = pm.Normal("theta", mu=mu, sigma=tau, shape=J)
    obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y)

Unfortunately, this direct implementation of the model exhibits a pathological geometry that frustrates geometric ergodicity. To understand this bias, let’s consider first a short Markov chain, commonly used when computational expediency is a motivating factor, and only afterwards a longer Markov chain.

A Dangerously-Short Markov Chain#

with Centered_eight:
    short_trace = pm.sample(200, random_seed=SEED)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta]

Sampling 4 chains for 1_000 tune and 200 draw iterations (4_000 + 800 draws total) took 8 seconds.
There were 18 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

We can see a warning message indicating that we have several issues. \(\hat{R}\) is too high, the effective samples size is too low and we have divergences. We can check the summary of the diagnostics:

az.summary(short_trace, kind="diagnostics", round_to=2)
ess_bulk ess_tail r_hat mcse_mean mcse_sd
mu 143.38 159.95 1.02 0.27 0.17
theta[0] 216.35 292.38 1.02 0.40 0.47
theta[1] 265.87 372.29 1.02 0.28 0.25
theta[2] 255.69 389.88 1.01 0.31 0.31
theta[3] 278.02 334.57 1.01 0.29 0.27
theta[4] 218.00 369.91 1.02 0.31 0.27
theta[5] 256.01 435.87 1.01 0.30 0.25
theta[6] 211.84 216.72 1.01 0.33 0.32
theta[7] 282.80 404.96 1.02 0.31 0.32
tau 73.77 37.21 1.04 0.30 0.36

Visually we can check if the chains have converged by plotting the trace:

az.plot_trace(short_trace);

In this example the problems are very clear to see, for many of the parameters, we can see that the chains have not mixed well and are stuck in different regions of the parameter space. But sometimes the problems are not so obvious. An alternative way to visualize sampling problems is through rank plots. Usually rank plots will detect more subtle problems with convergence that are not obvious in trace plots. For rank plots, we expect to see a uniform distribution of ranks across chains. The gray envelope indicates expected variability.

az.plot_rank(short_trace);

Let’s focus now one parameter \(\tau\), or more specifically, its logarithm, \(log(\tau)\). Because \(\tau\) is constrained to be positive, its logarithm will allow us to better resolve behavior for small values. Indeed the chains seems to be exploring both small and large values reasonably well.

Unfortunately, the resulting estimate for the mean of \(log(\tau)\) is strongly biased away from the true value, here shown in grey.

# plot the estimate for the cumulative mean of log(τ)
short_trace.posterior["log_tau"] = np.log(short_trace.posterior["tau"])
logtau = az.extract(short_trace, var_names=["log_tau"])
mlogtau = logtau.cumulative("sample").mean()

_, ax = plt.subplots(figsize=(10, 4))
ax.axhline(0.7657852, color="gray")
ax.plot(mlogtau, lw=2.5)
ax.set(xlabel="Iteration", ylabel="MCMC mean of log(tau)", title="MCMC estimation of log(tau)");

Hamiltonian Monte Carlo, however, is not so oblivious to these issues as \(\approx\) 3% of the iterations in our lone Markov chain ended with a divergence.

# display the total number divergences and its percentage
divergent = short_trace.sample_stats["diverging"]
print(f"Number of Divergences {divergent.sum().item()}")
print(f"Percentage of Divergent {divergent.mean().item() * 100:.1f}")
Number of Divergences 18
Percentage of Divergent 2.2

Additionally, because the divergent transitions, here shown in green, tend to be located near the pathologies we can use them to identify the location of the problematic neighbourhoods in parameter space.

az.plot_pair(
    short_trace,
    var_names=["theta", "log_tau"],
    coords={"theta_dim_0": 0},
    visuals={"divergence": True},
);

It is important to point out that the pathological samples from the trace are not necessarily concentrated at the funnel: when a divergence is encountered, the subtree being constructed is rejected and the transition samples uniformly from the existing discrete trajectory. Consequently, divergent samples will not be located exactly in the region of high curvature.

There are many other ways to explore and visualize the pathological region in the parameter space. For example, we can reproduce Figure 5b in Visualization in Bayesian workflow

az.plot_parallel(short_trace);

A Safer, Longer Markov Chain#

The recommended practice is to run multiple chains for a longer number of iterations. For diagnostic like \(\hat{R}\) to be reliable it is recommended to run at least 4 chains and that each chain has at least 100 effective samples (a total of 400).

with Centered_eight:
    longer_trace = pm.sample(1000, chains=4, tune=1000, random_seed=SEED)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 9 seconds.
There were 137 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
az.summary(longer_trace, kind="diagnostics", round_to=2)
ess_bulk ess_tail r_hat mcse_mean mcse_sd
mu 588.54 1023.19 1.01 0.13 0.09
theta[0] 863.45 1207.49 1.02 0.20 0.23
theta[1] 1158.17 1625.65 1.02 0.14 0.13
theta[2] 1029.00 1155.82 1.02 0.15 0.16
theta[3] 1009.68 1149.76 1.02 0.15 0.14
theta[4] 872.26 1471.00 1.02 0.16 0.14
theta[5] 1120.49 1139.74 1.02 0.14 0.13
theta[6] 845.07 1278.19 1.01 0.18 0.18
theta[7] 1229.71 1184.11 1.02 0.15 0.14
tau 41.33 37.16 1.07 0.39 0.50
pc = az.plot_trace(longer_trace, var_names=["tau"], visuals={"divergence": False})

# identify the sticky intervals in the "purple" chain (chain 3)
chain = 3
min_length = 20
low = (longer_trace.posterior.sel(chain=chain)["tau"] < 2).values
changes = np.diff(low.astype(int))
starts = np.where(changes == 1)[0] + 1
ends = np.where(changes == -1)[0]
sticky = [(s, e) for s, e in zip(starts, ends) if e - s > min_length]
# add gray bands to the trace plot
az.add_bands(pc, sticky);

We see problems with \(\hat{R}\), the effective sample size (in particular for tau) does not indicate any serious issues. As shown in the trace plot, chains occasionally “sticks” as it approaches small values of \(\tau\) (see for example the “purple” chain, at the region highlighted by the gray band), exactly where we saw the divergences concentrating. This is a clear indication of the underlying pathologies. These sticky intervals induce severe oscillations in the MCMC estimators early on, until they seem to finally settle into biased values.

In fact the sticky intervals are the Markov chain trying to correct the biased exploration. If we ran the chain even longer then it would eventually get stuck again and drag the MCMC estimator down towards the true value. Given an infinite number of iterations this delicate balance asymptotes to the true expectation as we’d expect given the consistency guarantee of MCMC. Stopping after any finite number of iterations, however, destroys this balance and leaves us with a significant bias. More details can be found in the work by .

Mitigating Divergences by Adjusting PyMC’s Adaptation Routine#

Divergences in Hamiltonian Monte Carlo arise when the Hamiltonian transition encounters regions of extremely large curvature, such as the opening of the hierarchical funnel. Unable to accurate resolve these regions, the transition malfunctions and flies off towards infinity. With the transitions unable to completely explore these regions of extreme curvature, we lose geometric ergodicity and our MCMC estimators become biased.

The default algorithm implemented in PyMC uses a heuristic to quickly identify these misbehaving trajectories, and hence label divergences, without having to wait for them to run all the way to infinity. This heuristic can be a bit aggressive, however, and sometimes label transitions as divergent even when we have not lost geometric ergodicity.

To resolve this potential ambiguity we can adjust the step size, \(\epsilon\), of the Hamiltonian transition. The smaller the step size the more accurate the trajectory and the less likely it will be mislabeled as a divergence. In other words, if we have geometric ergodicity between the Hamiltonian transition and the target distribution then decreasing the step size will reduce and then ultimately remove the divergences entirely. If we do not have geometric ergodicity, however, then decreasing the step size will not completely remove the divergences.

The step size in PyMC is tuned automatically during warm up, but we can coerce smaller step sizes by tweaking the configuration of PyMC’s adaptation routine. In particular, we can increase the target_accept parameter from its default value of 0.8 closer to its maximum value of 1.

Adjusting Adaptation Routine#

with Centered_eight:
    fit_cp80 = pm.sample(tune=3000, target_accept=0.80, random_seed=SEED)
    fit_cp85 = pm.sample(tune=3000, target_accept=0.85, random_seed=SEED)
    fit_cp90 = pm.sample(tune=3000, target_accept=0.90, random_seed=SEED)
    fit_cp95 = pm.sample(tune=3000, target_accept=0.95, random_seed=SEED)
    fit_cp99 = pm.sample(tune=3000, target_accept=0.99, random_seed=SEED)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta]

Sampling 4 chains for 3_000 tune and 1_000 draw iterations (12_000 + 4_000 draws total) took 10 seconds.
There were 89 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta]

Sampling 4 chains for 3_000 tune and 1_000 draw iterations (12_000 + 4_000 draws total) took 11 seconds.
There were 75 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta]

Sampling 4 chains for 3_000 tune and 1_000 draw iterations (12_000 + 4_000 draws total) took 12 seconds.
There were 191 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta]

Sampling 4 chains for 3_000 tune and 1_000 draw iterations (12_000 + 4_000 draws total) took 14 seconds.
There were 150 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta]

Sampling 4 chains for 3_000 tune and 1_000 draw iterations (12_000 + 4_000 draws total) took 35 seconds.
There were 707 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
df = pd.DataFrame([".80", ".85", ".90", ".95", ".99"], columns=["Delta_target"])

df["Step_size"] = pd.Series(
    [
        longer_trace.sample_stats["step_size"].mean().item(),
        fit_cp85.sample_stats["step_size"].mean().item(),
        fit_cp90.sample_stats["step_size"].mean().item(),
        fit_cp95.sample_stats["step_size"].mean().item(),
        fit_cp99.sample_stats["step_size"].mean().item(),
    ],
)

df["Divergences"] = pd.Series(
    [
        longer_trace.sample_stats["diverging"].sum().item(),
        fit_cp85.sample_stats["diverging"].sum().item(),
        fit_cp90.sample_stats["diverging"].sum().item(),
        fit_cp95.sample_stats["diverging"].sum().item(),
        fit_cp99.sample_stats["diverging"].sum().item(),
    ]
)
df
Delta_target Step_size Divergences
0 .80 0.280170 137
1 .85 0.200222 75
2 .90 0.125180 191
3 .95 0.109921 150
4 .99 0.031289 707

Here, the number of divergent tends to be smaller when delta increases. This behavior has a nice geometric intuition. The more we decrease the step size the more the Hamiltonian Markov chain can explore the neck of the funnel. Consequently, the marginal posterior distribution for \(log (\tau)\) stretches further and further towards negative values with the decreasing step size.

However, the Hamiltonian transition is still not geometrically ergodic with respect to the centered implementation of the Eight Schools model. Indeed, this is expected given the observed bias.

longer_trace.posterior["log_tau"] = np.log(longer_trace.posterior["tau"])
fit_cp80.posterior["log_tau"] = np.log(fit_cp80.posterior["tau"])
fit_cp85.posterior["log_tau"] = np.log(fit_cp85.posterior["tau"])
fit_cp90.posterior["log_tau"] = np.log(fit_cp90.posterior["tau"])
fit_cp95.posterior["log_tau"] = np.log(fit_cp95.posterior["tau"])
fit_cp99.posterior["log_tau"] = np.log(fit_cp99.posterior["tau"])
pc = az.plot_dist(
    {"fit_cp99": fit_cp99, "longer_trace": longer_trace},
    var_names=["log_tau"],
)
pc.add_legend("model");
_, ax = plt.subplots(figsize=(10, 4))

for target_accept, model in zip((0.8, 0.9, 0.99), (fit_cp85, fit_cp90, fit_cp99)):
    logtau = az.extract(model, var_names=["log_tau"])
    mlogtau = logtau.cumulative("sample").mean()
    ax.plot(mlogtau, lw=2.5, label=f"target_accept={target_accept:.2f}")

ax.axhline(0.7657852, color="gray")
ax.set(xlabel="Iteration", ylabel="MCMC mean of log(tau)", title="MCMC estimation of log(tau)")
ax.legend();

A Non-Centered Eight Schools Implementation#

Although reducing the step size improves exploration, ultimately it only reveals the true extent the pathology in the centered implementation. Fortunately, there is another way to implement hierarchical models that does not suffer from the same pathologies.

In a non-centered parameterization we do not try to fit the group-level parameters directly, rather we fit a latent Gaussian variable from which we can recover the group-level parameters with a scaling and a translation.

\[\mu \sim \mathcal{N}(0, 5)\]
\[\tau \sim \text{Half-Cauchy}(0, 5)\]
\[\tilde{\theta}_{n} \sim \mathcal{N}(0, 1)\]
\[\theta_{n} = \mu + \tau \cdot \tilde{\theta}_{n}.\]
with pm.Model() as NonCentered_eight:
    mu = pm.Normal("mu", mu=0, sigma=5)
    tau = pm.HalfCauchy("tau", beta=5)
    theta_tilde = pm.Normal("theta_t", mu=0, sigma=1, shape=J)
    theta = pm.Deterministic("theta", mu + tau * theta_tilde)
    obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y)
with NonCentered_eight:
    fit_ncp80 = pm.sample(random_seed=SEED)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta_t]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 9 seconds.
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.

We no longer see warnings about \(\hat R\) or the effective sample size, still we get a few divergences. These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives.

fit_ncp80.posterior["log_tau"] = np.log(fit_ncp80.posterior["tau"])

az.plot_pair(
    fit_ncp80,
    var_names=["theta", "log_tau"],
    coords={"theta_dim_0": 0},
    visuals={"divergence": True},
);

As expected of false positives, we can remove the divergences entirely by decreasing the step size.

with NonCentered_eight:
    fit_ncp90 = pm.sample(random_seed=SEED, target_accept=0.90)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, tau, theta_t]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 8 seconds.

The more agreeable geometry of the non-centered implementation allows the Markov chain to explore deep into the neck of the funnel, capturing even the smallest values of tau (\(\tau\)) that are consistent with the measurements. Consequently, MCMC estimators from the non-centered chain rapidly converge towards their true expectation values.

pc = az.plot_dist(
    {"fit_ncp80": fit_ncp80, "fit_cp99": fit_cp99, "fit_cp90": fit_cp90},
    var_names=["log_tau"],
    visuals={"point_estimate_text": False},
)
pc.add_legend("model");

Authors#

%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Fri, 01 May 2026

Python implementation: CPython
Python version       : 3.14.4
IPython version      : 9.12.0

arviz     : 1.1.0
matplotlib: 3.10.8
numpy     : 2.4.4
pandas    : 3.0.2
pymc      : 5.28.0+58.gf58491a3

Watermark: 2.6.0

License notice#

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples#

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Important

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

which once rendered could look like:

Agustina Arroyuelo . "Diagnosing Biased Inference with Divergences". In: PyMC Examples. Ed. by PyMC Team. DOI: 10.5281/zenodo.5654871