Dependent density regression#

In another example, we showed how to use Dirichlet processes to perform Bayesian nonparametric density estimation. This example expands on the previous one, illustrating dependent density regression.

Just as Dirichlet process mixtures can be thought of as infinite mixture models that select the number of active components as part of inference, dependent density regression can be thought of as infinite mixtures of experts that select the active experts as part of inference. Their flexibility and modularity make them powerful tools for performing nonparametric Bayesian Data analysis.

from io import StringIO

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt
import requests
import seaborn as sns

from matplotlib import animation as ani
from matplotlib import pyplot as plt

print(f"Running on PyMC v{pm.__version__}")
Running on PyMC v5.16.2
%config InlineBackend.figure_format = 'retina'
plt.rc("animation", writer="ffmpeg")
blue, *_ = sns.color_palette()
az.style.use("arviz-darkgrid")
SEED = 1972917  # from random.org; for reproducibility
np.random.seed(SEED)

We will use the LIDAR data set from Larry Wasserman’s excellent book, All of Nonparametric Statistics. We standardize the data set to improve the rate of convergence of our samples.

DATA_URI = "http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat"


def standardize(x):
    return (x - x.mean()) / x.std()


response = requests.get(DATA_URI, verify=False)
df = pd.read_csv(StringIO(response.text), sep=r"\s{1,3}", engine="python").assign(
    std_range=lambda df: standardize(df.range), std_logratio=lambda df: standardize(df.logratio)
)
/var/home/fonnesbeck/repos/pymc-examples/.pixi/envs/default/lib/python3.12/site-packages/urllib3/connectionpool.py:1099: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.stat.cmu.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings
  warnings.warn(
df.head()
range logratio std_range std_logratio
0 390 -0.050356 -1.717725 0.852467
1 391 -0.060097 -1.707299 0.817981
2 393 -0.041901 -1.686447 0.882398
3 394 -0.050985 -1.676020 0.850240
4 396 -0.059913 -1.655168 0.818631

We plot the LIDAR data below.

fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(df.std_range, df.std_logratio, color=blue)

ax.set_xticklabels([])
ax.set_xlabel("Standardized range")

ax.set_yticklabels([])
ax.set_ylabel("Standardized log ratio");

This data set has a two interesting properties that make it useful for illustrating dependent density regression.

  1. The relationship between range and log ratio is nonlinear, but has locally linear components.

  2. The observation noise is heteroskedastic; that is, the magnitude of the variance varies with the range.

The intuitive idea behind dependent density regression is to reduce the problem to many (related) density estimates, conditioned on fixed values of the predictors. The following animation illustrates this intuition.

fig, (scatter_ax, hist_ax) = plt.subplots(ncols=2, figsize=(16, 6))

scatter_ax.scatter(df.std_range, df.std_logratio, color=blue, zorder=2)

scatter_ax.set_xticklabels([])
scatter_ax.set_xlabel("Standardized range")

scatter_ax.set_yticklabels([])
scatter_ax.set_ylabel("Standardized log ratio")

bins = np.linspace(df.std_range.min(), df.std_range.max(), 25)

hist_ax.hist(df.std_logratio, bins=bins, color="k", lw=0, alpha=0.25, label="All data")

hist_ax.set_xticklabels([])
hist_ax.set_xlabel("Standardized log ratio")

hist_ax.set_yticklabels([])
hist_ax.set_ylabel("Frequency")

hist_ax.legend(loc=2)

endpoints = np.linspace(1.05 * df.std_range.min(), 1.05 * df.std_range.max(), 15)

frame_artists = []

for low, high in zip(endpoints[:-1], endpoints[2:]):
    interval = scatter_ax.axvspan(low, high, color="k", alpha=0.5, lw=0, zorder=1)
    *_, bars = hist_ax.hist(
        df[df.std_range.between(low, high)].std_logratio, bins=bins, color="k", lw=0, alpha=0.5
    )

    frame_artists.append((interval,) + tuple(bars))

animation = ani.ArtistAnimation(fig, frame_artists, interval=500, repeat_delay=3000, blit=True)
plt.close()
# prevent the intermediate figure from showing
from IPython.display import HTML

HTML(animation.to_html5_video())

As we slice the data with a window sliding along the x-axis in the left plot, the empirical distribution of the y-values of the points in the window varies in the right plot. An important aspect of this approach is that the density estimates that correspond to close values of the predictor are similar.

In the previous example, we saw that a Dirichlet process estimates a probability density as a mixture model with infinitely many components. In the case of normal component distributions,

\[y \sim \sum_{i = 1}^{\infty} w_i \cdot N(\mu_i, \tau_i^{-1}),\]

where the mixture weights, \(w_1, w_2, \ldots\), are generated by a stick-breaking process.

Dependent density regression generalizes this representation of the Dirichlet process mixture model by allowing the mixture weights and component means to vary conditioned on the value of the predictor, \(x\). That is,

\[y\ |\ x \sim \sum_{i = 1}^{\infty} w_i\ |\ x \cdot N(\mu_i\ |\ x, \tau_i^{-1}).\]

In this example, we will follow Chapter 23 of Bayesian Data Analysis and use a probit stick-breaking process to determine the conditional mixture weights, \(w_i\ |\ x\). The probit stick-breaking process starts by defining

\[v_i\ |\ x = \Phi(\alpha_i + \beta_i x),\]

where \(\Phi\) is the cumulative distribution function of the standard normal distribution. We then obtain \(w_i\ |\ x\) by applying the stick breaking process to \(v_i\ |\ x\). That is,

\[w_i\ |\ x = v_i\ |\ x \cdot \prod_{j = 1}^{i - 1} (1 - v_j\ |\ x).\]

For the LIDAR data set, we use independent normal priors \(\alpha_i \sim N(0, 5^2)\) and \(\beta_i \sim N(0, 5^2)\). We now express this model for the conditional mixture weights using PyMC.

def norm_cdf(z):
    return 0.5 * (1 + pt.erf(z / np.sqrt(2)))


def stick_breaking(v):
    return v * pt.concatenate(
        [pt.ones_like(v[:, :1]), pt.extra_ops.cumprod(1 - v[:, :-1], axis=1)], axis=1
    )
N = len(df)
K = 20

std_range = df.std_range.values
std_logratio = df.std_logratio.values

with pm.Model(coords={"N": np.arange(N), "K": np.arange(K) + 1}) as model:
    alpha = pm.Normal("alpha", 0, 5, dims="K")
    beta = pm.Normal("beta", 0, 5, dims="K")
    x = pm.Data("x", std_range, dims="N")
    v = norm_cdf(alpha + pt.outer(x, beta))
    w = pm.Deterministic("w", stick_breaking(v), dims=["N", "K"])

We have defined x as a pm.Data container in order to use PyMC’s posterior prediction capabilities later.

While the dependent density regression model theoretically has infinitely many components, we must truncate the model to finitely many components (in this case, twenty) in order to express it using PyMC. After sampling from the model, we will verify that truncation did not unduly influence our results.

Since the LIDAR data seems to have several linear components, we use the linear models

\[\begin{split} \begin{align*} \mu_i\ |\ x & \sim \gamma_i + \delta_i x \\ \gamma_i & \sim N(0, 10^2) \\ \delta_i & \sim N(0, 10^2) \end{align*} \end{split}\]

for the conditional component means.

with model:
    gamma = pm.Normal("gamma", 0, 3, dims="K")
    delta = pm.Normal("delta", 0, 3, dims="K")
    mu = pm.Deterministic("mu", gamma + pt.outer(x, delta), dims=("N", "K"))

Finally, we specify a NormalMixture likelihood function, using the weights we have modeled above.

with model:
    sigma = pm.HalfNormal("sigma", 3, dims="K")
    y = pm.Data("y", std_logratio, dims="N")
    obs = pm.NormalMixture("obs", w, mu, sigma=sigma, observed=y, dims="N")

pm.model_to_graphviz(model)
../_images/e739d465a1205139513957d80513bea0e672a30ef4fac4ffe2330a71d8f082cf.svg

We now sample from the dependent density regression model using a Metropolis sampler. The default NUTS sampler has a difficult time sampling the stick-breaking model, so we will employ a CompoundSampler, using a slice sampler for alpha and beta while leaving NUTS for the rest of the parameters.

with model:
    trace = pm.sample(random_seed=SEED, step=pm.Slice([alpha, beta]), tune=5_000, cores=2)
Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>CompoundStep
>>Slice: [alpha]
>>Slice: [beta]
>NUTS: [gamma, delta, sigma]

Sampling 2 chains for 5_000 tune and 1_000 draw iterations (10_000 + 2_000 draws total) took 299 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

We can see from the R-hat diagnostics below (all near 1.0) that the model is reasonably well converged.

az.summary(trace, var_names=["beta"])
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta[1] 6.657 2.228 3.317 10.882 0.196 0.139 143.0 375.0 1.04
beta[2] 9.232 2.502 5.060 14.044 0.068 0.050 1462.0 1403.0 1.00
beta[3] -2.719 3.672 -10.514 3.325 0.113 0.080 1061.0 1204.0 1.00
beta[4] 0.070 5.044 -9.227 10.007 0.111 0.108 2050.0 1481.0 1.00
beta[5] -0.032 4.961 -9.759 8.799 0.119 0.113 1725.0 1400.0 1.00
beta[6] -0.092 5.142 -9.506 9.342 0.117 0.136 1942.0 1313.0 1.00
beta[7] -0.145 4.993 -8.767 9.865 0.113 0.104 1951.0 1416.0 1.00
beta[8] -0.120 4.882 -8.992 8.987 0.108 0.100 2072.0 1409.0 1.00
beta[9] 0.190 5.001 -9.486 9.327 0.114 0.121 1923.0 1244.0 1.00
beta[10] 0.026 4.926 -9.552 8.820 0.110 0.112 2022.0 1635.0 1.00
beta[11] -0.057 4.936 -9.251 9.237 0.120 0.112 1707.0 1460.0 1.00
beta[12] -0.044 5.095 -9.652 9.425 0.124 0.115 1709.0 1399.0 1.00
beta[13] -0.046 4.874 -8.849 9.447 0.105 0.110 2141.0 1500.0 1.00
beta[14] 0.016 4.910 -9.891 8.236 0.140 0.111 1245.0 1303.0 1.00
beta[15] -0.117 4.847 -8.594 9.390 0.112 0.108 1876.0 1510.0 1.00
beta[16] -0.027 5.003 -8.753 9.791 0.116 0.116 1863.0 1496.0 1.00
beta[17] -0.169 5.087 -9.898 8.967 0.119 0.118 1828.0 1523.0 1.00
beta[18] 0.225 4.990 -9.447 9.367 0.112 0.111 1986.0 1425.0 1.00
beta[19] -0.007 4.885 -9.784 8.684 0.110 0.111 1966.0 1533.0 1.00
beta[20] -0.013 4.912 -9.668 8.658 0.111 0.110 1956.0 1533.0 1.00

To verify that truncation did not unduly influence our results, we plot the largest posterior expected mixture weight for each component. (In this model, each point has a mixture weight for each component, so we plot the maximum mixture weight for each component across all data points in order to judge if the component exerts any influence on the posterior.)

fig, ax = plt.subplots(figsize=(8, 6))

max_mixture_weights = trace.posterior["w"].mean(("chain", "draw")).max("N")
ax.bar(max_mixture_weights.coords.to_index(), max_mixture_weights)

ax.set_xlim(1 - 0.5, K + 0.5)
ax.set_xticks(np.arange(0, K, 2) + 1)
ax.set_xlabel("Mixture component")

ax.set_ylabel("Largest posterior expected\nmixture weight");

Since only six mixture components have appreciable posterior expected weight for any data point, we can be fairly certain that truncation did not unduly influence our results. (If most components had appreciable posterior expected weight, truncation may have influenced the results, and we would have increased the number of components and sampled again.)

Visually, it is reasonable that the LIDAR data has three linear components, so these posterior expected weights seem to have identified the structure of the data well. We now sample from the posterior predictive distribution to get a better understand the model’s performance.

lidar_pp_x = np.linspace(std_range.min() - 0.05, std_range.max() + 0.05, 100)

with model:
    pm.set_data(
        {"x": lidar_pp_x, "y": np.zeros_like(lidar_pp_x)}, coords={"N": np.arange(len(lidar_pp_x))}
    )

    pm.sample_posterior_predictive(
        trace, predictions=True, extend_inferencedata=True, random_seed=SEED
    )
Sampling: [obs]

Below we plot the posterior expected value and the 95% posterior credible interval.

fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(df.std_range, df.std_logratio, color=blue, zorder=10, label=None)

low, high = np.percentile(az.extract(trace.predictions)["obs"].T, [2.5, 97.5], axis=0)
ax.fill_between(
    lidar_pp_x, low, high, color="k", alpha=0.35, zorder=5, label="95% posterior credible interval"
)

ax.plot(
    lidar_pp_x,
    trace.predictions["obs"].mean(("chain", "draw")).values,
    c="k",
    zorder=6,
    label="Posterior expected value",
)

ax.set_xticklabels([])
ax.set_xlabel("Standardized range")

ax.set_yticklabels([])
ax.set_ylabel("Standardized log ratio")

ax.legend(loc=1)
ax.set_title("LIDAR Data");

The model has fit the linear components of the data well, and also accommodated its heteroskedasticity. This flexibility, along with the ability to modularly specify the conditional mixture weights and conditional component densities, makes dependent density regression an extremely useful nonparametric Bayesian model.

To learn more about dependent density regression and related models, consult Bayesian Data Analysis, Bayesian Nonparametric Data Analysis, or Bayesian Nonparametrics.

This example first appeared here.

Authors#

  • authored by Austin Rochford in 2017

  • updated to PyMC v5 by Christopher Fonnesbeck in September 2024

References#

%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Mon Sep 30 2024

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

seaborn   : 0.13.2
IPython   : 8.27.0
requests  : 2.32.3
matplotlib: 3.9.2
pymc      : 5.16.2
numpy     : 1.26.4
arviz     : 0.19.0
pandas    : 2.2.2
pytensor  : 2.25.4

Watermark: 2.5.0
Austin Rochford . "Dependent density regression". In: PyMC Examples. Ed. by PyMC Team. DOI: 10.5281/zenodo.5654871