Revisiting Bangladesh Fertility#
utils.draw_causal_graph(
edge_list=[
("A", "C"),
("K", "C"),
("U", "C"),
("D", "C"),
("D", "U"),
("U", "K"),
("A", "K"),
("D", "K"),
]
)
Estimand 1: Contraceptive use \(C\) in each District \(D\)
Estimand 2: Effect of “urbanity” \(U\)
Estimand 3: Effect of # of kids \(K\) and age \(A\)
Districts are CLUSTERS#
Rural / Urban breakdown#
Below we show the contraceptive use for rural (only including a district-level intercept \(\alpha_{D[i]}\)) vs urban (including an additional intercept \(\beta_{D[i]}\) for urban districts) groups. These plow was generated in Lecture 13 - Multilevel Adventures.
utils.display_image("fertility_posterior_means_rural_urban.png", width=1000)

The plot shows:
Rural populations are, on average are less likely to use contraception than urban areas
the dashed line in the top plot is lower than that in the lower plot
There are fewer urban samples, so there is a larger range of uncertaint for urban areas
the error blue bars associated with urban popluations are larger than for rural areas, where there were more polls sampled.
🤔#
If we know about the contraceptive use in rural areas, you can make a better guess about urban contraceptive use because of this correlation
We are leaving information on the table by not letting the Golem–i.e. our statistical model–leverage this feature correlation
This lecture focuses on building statistical models that can capture correlation information amongst features.
makes inference more efficient by using partial pooling across features
Estimating Correlation and Partial Pooling Demo
McElreath goes on to show a demo of Bayesian updating in a model with a correlated features. Ιt’s great, and I recommend going over the demo a few times, but I’m too lazy to implement it (nor clever enough to do so without an animation, which I’m trying to avoid). It’ll go on the TODO list.
BONUS: Non-centered (aka Transformed) Priors#
Inconvenient Posteriors#
Inefficient MCMC caused by steep curvature in the neg-log-likelihood
Hamiltonian Monte Carlo has trouble exploring steep surface
leading to “divergent” transistions
“punching through wall of skate park”
detected when the Hamiltonian changes value dramatically between start/end of proposal
Transforming the prior to be a “smoother skate park” helps
Example: Devil’s Funnel prior#
As \(\sigma_v\) increases, a nasty trough forms in the prior probability surface that is difficult for Hamiltonian dynamics to sample–i.e. the skatepark is too steep and narrow.
I’m too lazy to code up the fancy HMC animation that McElreath uses to visualize each divergent path. However, we can still verifty that the number of divergences increases when sampling the devil’s funnel as we increase the v prior’s \(\sigma\) in the devil’s funnel prior.
Centered-prior models#
from functools import partial
xs = vs = np.linspace(-4, 4, 100)
def prior_devils_funnel(xs, vs, sigma_v=0.5):
log_prior_v = stats.norm(0, sigma_v).logpdf(vs)
log_prior_x = stats.norm(0, np.exp(vs)).logpdf(xs)
log_prior = log_prior_v + log_prior_x
return np.exp(log_prior - log_prior.max())
# Loop over sigma_v, and show that divergences increase
# as the depth and narrowness of the trough increases
# for low values of v
sigma_vs = [0.5, 1, 3]
n_sigma_vs = len(sigma_vs)
fig, axs = plt.subplots(3, 1, figsize=(4, 4 * n_sigma_vs))
for ii, sigma_v in enumerate(sigma_vs):
with pm.Model() as centered_devils_funnel:
v = pm.Normal("v", 0, sigma_v)
x = pm.Normal("x", 0, pm.math.exp(v))
centered_devils_funnel_inference = pm.sample()
n_divergences = centered_devils_funnel_inference.sample_stats.diverging.sum().values
prior = partial(prior_devils_funnel, sigma_v=sigma_v)
plt.sca(axs[ii])
utils.plot_2d_function(xs, ys, prior, colors="gray", ax=axs[ii])
plt.xlabel("x")
plt.ylabel("v")
plt.title(f"$v \sim Normal(0, {sigma_v})$\n# Divergences: {n_divergences}")
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, x]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
There were 9 divergences after tuning. Increase `target_accept` or reparameterize.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, x]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
There were 405 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, x]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
There were 622 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

What to do?#
Smaller step size: handles steepness better, but takes longer to explore the posterior
Re-parameterize (tranform) to make surface smoother
Non-centered prior#
We add an auxilary variable \(z\) that has a smooth probability surface. We then sample that auxilary variable, and transform it to obtain the target variable distribution. For the case of the devi’s funnel prior:
sigma_vs = [0.5, 1, 3]
n_sigma_vs = len(sigma_vs)
fig, axs = plt.subplots(3, 1, figsize=(4, 4 * n_sigma_vs))
def prior_noncentered_devils_funnel(xs, vs, sigma_v=0.5):
"""Reparamterize into a smoother auxilary variable space"""
log_prior_v = stats.norm(0, sigma_v).logpdf(vs)
log_prior_z = stats.norm(0, 1).logpdf(xs)
log_prior = log_prior_v + log_prior_z
return np.exp(log_prior - log_prior.max())
for ii, sigma_v in enumerate(sigma_vs):
with pm.Model() as noncentered_devils_funnel:
v = pm.Normal("v", 0, sigma_v)
z = pm.Normal("z", 0, 1)
# Record x for reporting
x = pm.Deterministic("x", z * pm.math.exp(v))
noncentered_devils_funnel_inference = pm.sample()
n_divergences = noncentered_devils_funnel_inference.sample_stats.diverging.sum().values
prior = partial(prior_noncentered_devils_funnel, sigma_v=sigma_v)
plt.sca(axs[ii])
utils.plot_2d_function(xs, ys, prior, colors="gray", ax=axs[ii])
plt.xlabel("z")
plt.ylabel("v")
plt.title(f"$v \sim$ Normal(0, {sigma_v})\n# Divergences: {n_divergences}")
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, z]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, z]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [v, z]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.

By reparameterizing, we get to sample multi-dimensional Normal distributions, which are smoother parabolas in the log space.
We can see that the number of divergence is consistently reduced for all values of prior variance.
Check HMC Diagnostics#
az.summary(noncentered_devils_funnel_inference)["r_hat"]
v 1.0
x 1.0
z 1.0
Name: r_hat, dtype: float64
az.plot_trace(noncentered_devils_funnel_inference);

Diagnostics look good 👍
Rhat
s = 1\(v\) and \(z\) chains exhibit “fuzzy caterpillar” behavior
Cholesky Factors & Correlation Matrices#
Cholesky factor, \(\bf L\) provides an efficient means for encoding a correlation matrix \(\Omega\) (requires fewer floating point numbers than the full correlation matrix)
\(\Omega = \bf LL^T\)
\(\bf L\) is a lower-triangular matrix
we can sample data with correlation \(\Omega\), and standard deviations \(\sigma_i\), using the Cholesky factorization \(\bf L\) of \(\Omega\) as follows:
where \(\bf Z\) is a matrix of z-scores sampled from a standard normal distribution
Demonstrate reconstructing correlation matrix from Cholesky factor#
orig_corr = np.array([[1, 0.6], [0.6, 1]])
print("Ground-truth Correlation Matrix\n", orig_corr)
L = np.linalg.cholesky(orig_corr)
print("\nCholesky Matrix of correlation matrix, L\n", L)
print("\nReconstructed correlation matrix from L\n", L @ L.T)
Ground-truth Correlation Matrix
[[1. 0.6]
[0.6 1. ]]
Cholesky Matrix of correlation matrix, L
[[1. 0. ]
[0.6 0.8]]
Reconstructed correlation matrix from L
[[1. 0.6]
[0.6 1. ]]
Demonstrate sampling random corrleation matrix from z-scores and the Cholesky factor#
np.random.seed(12345)
N = 100_000
orig_sigmas = [2, 0.5]
# Matrix of z-score samples
Z = stats.norm().rvs(size=(2, N))
print("Raw z-score samples are indpendent\n", np.corrcoef(Z[0], Z[1]))
# Transform Z-scores using the Cholesky factorization of the correlation matrix
sLZ = np.diag(orig_sigmas) @ L @ Z
corr_sLZ = np.corrcoef(sLZ[0], sLZ[1])
print("\nCholesky-transformed z-scores have original correlation encoded\n", corr_sLZ)
assert np.isclose(orig_corr[0, 1], corr_sLZ[0, 1], atol=0.01)
std_sLZ = sLZ.std(axis=1)
print("\nOriginal std devs also encoded in transformed z-scores:\n", std_sLZ)
assert np.isclose(orig_sigmas[0], std_sLZ[0], atol=0.01)
assert np.isclose(orig_sigmas[1], std_sLZ[1], atol=0.01)
Raw z-score samples are indpendent
[[1. 0.00293945]
[0.00293945 1. ]]
Cholesky-transformed z-scores have original correlation encoded
[[1. 0.60268456]
[0.60268456 1. ]]
Original std devs also encoded in transformed z-scores:
[2.00212936 0.50024976]
When to use Transformed Priors#
depends on context
Centered:
Lot’s of data in each cluster/subgroup
likelihood-dominant scenario
Non-centered:
Many clusters/subgroups, with sparse data in some of the clusters
prior-dominant scenario
License notice#
All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.
Citing PyMC examples#
To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.
Important
Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.
Also remember to cite the relevant libraries used by your code.
Here is an citation template in bibtex:
@incollection{citekey,
author = "<notebook authors, see above>",
title = "<notebook title>",
editor = "PyMC Team",
booktitle = "PyMC examples",
doi = "10.5281/zenodo.5654871"
}
which once rendered could look like: