Gaussian Processes#

This notebook is part of the PyMC port of the Statistical Rethinking 2023 lecture series by Richard McElreath.

Video - Lecture 16 - Gaussian Processes# Lecture 16 - Gaussian Processes

# Ignore warnings
import warnings

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import statsmodels.formula.api as smf
import utils as utils
import xarray as xr

from matplotlib import pyplot as plt
from matplotlib import style
from scipy import stats as stats

warnings.filterwarnings("ignore")

# Set matplotlib style
STYLE = "statistical-rethinking-2023.mplstyle"
style.use(STYLE)

Returning to Oceanic Technology Dataset#

We’ll now deal with some of the confounds we glossed over, now that we have some more advanced statistical tools

KLINE = utils.load_data("Kline2")
KLINE

	culture	population	contact	total_tools	mean_TU	lat	lon	lon2	logpop
0	Malekula	1100	low	13	3.2	-16.3	167.5	-12.5	7.003065
1	Tikopia	1500	low	22	4.7	-12.3	168.8	-11.2	7.313220
2	Santa Cruz	3600	low	24	4.0	-10.7	166.0	-14.0	8.188689
3	Yap	4791	high	43	5.0	9.5	138.1	-41.9	8.474494
4	Lau Fiji	7400	high	33	5.0	-17.7	178.1	-1.9	8.909235
5	Trobriand	8000	high	19	4.0	-8.7	150.9	-29.1	8.987197
6	Chuuk	9200	high	40	3.8	7.4	151.6	-28.4	9.126959
7	Manus	13000	low	28	6.6	-2.1	146.9	-33.1	9.472705
8	Tonga	17500	high	55	5.4	-21.2	-175.2	4.8	9.769956
9	Hawaii	275000	low	71	6.6	19.9	-155.6	24.4	12.524526

utils.draw_causal_graph(
    edge_list=[("P", "T"), ("C", "T"), ("U", "T"), ("U", "P")],
    node_props={"U": {"style": "dashed"}, "unobserved": {"style": "dashed"}},
)

../_images/ccb7cfa1d8edf9745574b1105b8d9494be5f999deb2e29fd6dadf395be7c6671.svg

Total tools in society $T$ (outcome)
Populuation $P$ (treatment)
Contact level $C$
Unobserved confounds $U$, e.g.
- materials
- weather

There should be spatial covariation:

islands closer to one another should have similar unobserved confounds
islands should share historical context
we don’t have to know the exact dymanics of the contexts, we can handle statistically at a macro scale

Review of Techonological Innovation/Loss model#

Derived from steady state of difference equation $\Delta T = \alpha P^ \beta - \gamma T$
- $\alpha$ is rate of innovation
- $P$ is population
- $\beta$ is elasticity rate (diminishing returns)
- $\gamma$ is loss (e.g. forgetting) rate over time
Provides the expected number of tools over the long-run:
- $\hat T = \frac{\alpha P ^ \beta}{\gamma}$
- use $\lambda= \hat T$ as the expected rate parameter for a $\text{Poisson}(\lambda)$ distribution
How can we encode spatial covariance in this model?

Let’s start of with an easier model that ignores population#

\[\begin{split} \begin{align*} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \bar \alpha + \alpha_{S[i]} \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \end{align*} \end{split}\]

$\alpha_{S[i]}$ are the variable intercepts for each society
- These will have covariation based on their spatial distance from one another
- closer societies have similar offsets
$\mathbf{K}$ the covariance amongst societies
- models covariance as a function distance
- though $\mathbf{K}$ has many parameters (45 covariances), we should be able to leverage spatial regularity to estimate far fewer effective parameters. This is good b.c. we only have 10 data points 😅

Gaussian Processes (in the abstract)#

Uses a kernel function to generate covariance matrices
- Requires far fewer parameters than the covariance matrix entries
- Can generate covariance matrices of any dimenions (i.e. “infinite dimensional generalization of the MNNormal”)
- Can generate a prediction for any point
Kernel function calculates the covariance of two points based on metric of comparison, e.g.
- spatial distance
- difference in time
- difference in age

Different Kernel Functions#

def plot_kernel_function(
    kernel_function, max_distance=1, resolution=100, label=None, ax=None, **line_kwargs
):
    """Helper to plot a kernel function"""
    X = np.linspace(0, max_distance, resolution)[:, None]
    covariance = kernel_function(X, X)
    distances = np.linspace(0, max_distance, resolution)
    if ax is not None:
        plt.sca(ax)
    utils.plot_line(distances, covariance[0, :], label=label, **line_kwargs)
    plt.xlim([0, max_distance])
    plt.ylim([-0.05, 1.05])
    plt.xlabel("|X1-X2|")
    plt.ylabel("covariance")
    if label is not None:
        plt.legend()

The Quadratic (L2) Kernel Function#

aka RGB
aka Gaussian Kernel

\[ K(x_i, x_j) = \eta^2 \exp \left(- \frac{(x_i - x_j)^2}{\sigma^2} \right) \]

def quadratic_distance_kernel(X0, X1, eta=1, sigma=0.5):
    # Use linear algebra identity: ||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y
    X0_norm = np.sum(X0**2, axis=-1)
    X1_norm = np.sum(X1**2, axis=-1)
    squared_distances = X0_norm[:, None] + X1_norm[None, :] - 2 * X0 @ X1.T
    rho = 1 / sigma**2
    return eta**2 * np.exp(-rho * squared_distances)


plot_kernel_function(quadratic_distance_kernel)

../_images/6f3db8c32876251a1e74a5e2fa08549d059a4db938b88cbdb62be876e334c52d.png

Ornstein-Uhlenbeck Kernel#

\[\begin{split} \begin{align*} K(x_i, x_j) &= \eta^2 \exp \left(- \frac{|x_i - x_j|}{\sigma} \right) \\ \sigma &= \frac{1}{\rho} \end{align*} \end{split}\]

def ornstein_uhlenbeck_kernel(X0, X1, eta_squared=1, rho=4):
    distances = np.abs(X1[None, :] - X0[:, None])
    return eta_squared * np.exp(-rho * distances)


plot_kernel_function(ornstein_uhlenbeck_kernel)

../_images/4d0518f148c05a1e352ae317728d92f39696df9aca03a76a06f45fb85da87ffb.png

Periodic Kernel#

\[ K(x_i, x_j) = \eta^2 \exp \left(- \frac{\sin^2((x_i - x_j) / p)}{\sigma^2} \right) \]

with additional periodicity parameter, $p$

def periodic_kernel(X0, X1, eta=1, sigma=1, periodicity=0.5):
    distances = np.sin((X1[None, :] - X0[:, None]) * periodicity) ** 2
    rho = 2 / sigma**2
    return eta**2 * np.exp(-rho * distances)


plot_kernel_function(periodic_kernel, max_distance=6)

../_images/885013f177bba4e5a2adfed25461459e7d426e8fb0ea487e49fbd3fec08c437f.png

Gaussian Process Prior#

Using the Gaussian/L2 Kernel Function to generate our covariance $\mathbf{K}$, we can adjust a few parameters

Varying $\sigma$ controls the bandwidth of the kernel function
- smaller $\sigma$ makes covariance fall off quickly with space
- larger $\sigma$ allow covariance to extend larger distances
Varying $\eta$ controls the maximum degree of covariance

1-D Examples#

Below we draw functions from a 1-D Guassian process, varying either $\sigma$ or $\eta$ to demonstrate the effect of the parameters on the spatial correlation of the samples.

# Helper functions
def plot_gaussian_process(
    X, samples=None, mean=None, cov=None, X_obs=None, Y_obs=None, uncertainty_prob=0.89
):
    X = X.ravel()

    # Plot GP samples
    for ii, sample in enumerate(samples):
        label = "GP samples" if not ii else None
        utils.plot_line(X, sample, color=f"C{ii}", linewidth=1, label=label)

    # Add GP mean, if provided
    if mean is not None:
        mean = mean.ravel()
        utils.plot_line(X, mean, color="k", label="GP mean")

        # Add uncertainty around mean; requires covariance matrix
        if cov is not None:
            z = stats.norm.ppf(1 - (1 - uncertainty_prob) / 2)
            uncertainty = z * np.sqrt(np.diag(cov))
            plt.fill_between(
                X,
                mean + uncertainty,
                mean - uncertainty,
                alpha=0.1,
                color="gray",
                zorder=1,
                label="GP uncertainty",
            )

    # Add any training data points, if provided
    if X_obs is not None:
        utils.plot_scatter(X_obs, Y_obs, color="k", label="observations", zorder=100, alpha=1)

    plt.xlim([X.min(), X.max()])
    plt.ylim([-5, 5])
    plt.xlabel("x")
    plt.ylabel("y")
    plt.legend()


def plot_gaussian_process_prior(kernel_function, n_samples=3, figsize=(10, 5), resolution=100):
    X = np.linspace(-5, 5, resolution)[:, None]

    prior = gaussian_process_prior(X, kernel_function)
    samples = prior.rvs(n_samples)

    _, axs = plt.subplots(1, 2, figsize=figsize)
    plt.sca(axs[0])
    plot_gaussian_process(X, samples=samples)

    plt.sca(axs[1])
    plot_kernel_function(kernel_function, color="k")
    plt.title(f"kernel function")
    return axs


def gaussian_process_prior(X_pred, kernel_function):
    """Initializes a Gaussian Process prior distribution for provided Kernel function"""
    mean = np.zeros(X_pred.shape).ravel()
    cov = kernel_function(X_pred, X_pred)
    return stats.multivariate_normal(mean=mean, cov=cov, allow_singular=True)

Varying $\sigma$#

from functools import partial

eta = 1
for sigma in [0.25, 0.5, 1]:
    kernel_function = partial(quadratic_distance_kernel, eta=eta, sigma=sigma)
    axs = plot_gaussian_process_prior(kernel_function, n_samples=5)
    axs[0].set_title(f"prior: $\\eta$={eta}; $\\sigma=${sigma}")

../_images/26a63916cfc24239bc214f861250f3497a1d3302f285591fd80a72b21b835a91.png

../_images/d888340b7be4c983109b276b84c2f10ee1c261fbfc735c0f93cf2c5a1e1065bd.png

../_images/7323a6a1d7bf34332e3685f74c8416d6ce1f72e04436f6871683e48baf6ac2d1.png

Varying $\eta$#

sigma = 0.5
for eta in [0.25, 0.75, 1]:
    kernel_function = partial(quadratic_distance_kernel, eta=eta, sigma=sigma)
    axs = plot_gaussian_process_prior(kernel_function, n_samples=5)
    axs[0].set_title(f"prior: $\\eta$={eta}; $\\sigma=${sigma}")

../_images/c52ff45e705330bf4c1bf06a7d093f1c3fc26e649946b90a0993da61570b89a1.png

../_images/e914157d5f0075cc75a6808909e65d5fb52dfd2e5d82c72c8819d60236a2e986.png

../_images/1ad5a561be3d187e1dc9ca4f5f037f7539b3d389f0fb863f4ebf6e415da18126.png

Gaussian Process Posterior#

As the model observes data, it adjusts the posterior to account for that data while also incorporating the smoothness constraints of the prior.

def gaussian_process_posterior(
    X_obs, Y_obs, X_pred, kernel_function, sigma_y=1e-6, smoothing_factor=1e-6
):
    """Initializes a Gaussian Process Posterior distribution"""

    # Observation covariance
    K_obs = kernel_function(X_obs, X_obs)
    K_obs_noise = sigma_y**2 * np.eye(len(X_obs))
    K_obs += K_obs_noise
    K_obs_inv = np.linalg.inv(K_obs)

    # Prediction grid covariance
    K_pred = kernel_function(X_pred, X_pred)
    K_pred_smooth = smoothing_factor * np.eye(len(X_pred))
    K_pred += K_pred_smooth

    # Covariance between observations and prediction grid
    K_obs_pred = kernel_function(X_obs, X_pred)

    # Posterior
    posterior_mean = K_obs_pred.T.dot(K_obs_inv).dot(Y_obs)
    posterior_cov = K_pred - K_obs_pred.T.dot(K_obs_inv).dot(K_obs_pred)

    return stats.multivariate_normal(
        mean=posterior_mean.ravel(), cov=posterior_cov, allow_singular=True
    )


def plot_gaussian_process_posterior(
    X_obs,
    Y_obs,
    X_pred,
    kernel_function,
    sigma_y=1e-6,
    n_samples=3,
    figsize=(10, 5),
    resolution=100,
):
    X = np.linspace(-5, 5, resolution)[:, None]

    posterior = gaussian_process_posterior(X_obs, Y_obs, X, kernel_function, sigma_y=sigma_y)
    samples = posterior.rvs(n_samples)

    _, axs = plt.subplots(1, 2, figsize=figsize)
    plt.sca(axs[0])

    plot_gaussian_process(
        X_pred,
        samples=posterior.rvs(size=n_samples),
        mean=posterior.mean,
        cov=posterior.cov,
        X_obs=X_obs,
        Y_obs=Y_obs,
    )

    plt.sca(axs[1])
    plot_kernel_function(kernel_function, color="k")
    plt.title(f"kernel function")
    return axs

# Generate some training data
X_pred = np.linspace(-5, 5, 100)[:, None]
X_obs = np.linspace(-4, 4, 4)[:, None]
Y_obs = np.sin(X_obs) ** 2 + X_obs

# Initialize the kernel function
sigma_y = 0.25
sigma_kernel = 0.75
eta_kernel = 1
kernel_function = partial(quadratic_distance_kernel, eta=eta_kernel, sigma=sigma_kernel)

# Plot posterior
axs = plot_gaussian_process_posterior(
    X_obs, Y_obs, X_pred, kernel_function, sigma_y=sigma_y, n_samples=3
)
axs[0].set_title(f"posterior")
axs[1].set_title(f"kernel function:\n$\\eta$={eta}; $\\sigma=${sigma}");

../_images/8f63d309bab658f040fce1300b38697b27dc87d89b36d3c48000002e2eb3d5b6.png

Distance-based model#

\[\begin{split} \begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \bar \alpha + \alpha_{S[i]} \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal}\left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \\ k_{i,j} &= \eta^2\exp(-\rho^2D_{i,j}^2)) \\ \eta^2 &\sim \text{Exponential}(2) \\ \rho^2 &\sim \text{Exponential}(0.5) \end{align} \end{split}\]

Islands Distance Matrix#

# Distance matrix D_{i, j}
ISLANDS_DISTANCE_MATRIX = utils.load_data("islands_distance_matrix")
ISLANDS_DISTANCE_MATRIX.set_index(ISLANDS_DISTANCE_MATRIX.columns, inplace=True)
ISLANDS_DISTANCE_MATRIX.style.format(precision=1).background_gradient(cmap="viridis").set_caption(
    "Distance in Thousands of km"
)

Distance in Thousands of km
	Malekula	Tikopia	Santa Cruz	Yap	Lau Fiji	Trobriand	Chuuk	Manus	Tonga	Hawaii
Malekula	0.0	0.5	0.6	4.4	1.2	2.0	3.2	2.8	1.9	5.7
Tikopia	0.5	0.0	0.3	4.2	1.2	2.0	2.9	2.7	2.0	5.3
Santa Cruz	0.6	0.3	0.0	3.9	1.6	1.7	2.6	2.4	2.3	5.4
Yap	4.4	4.2	3.9	0.0	5.4	2.5	1.6	1.6	6.1	7.2
Lau Fiji	1.2	1.2	1.6	5.4	0.0	3.2	4.0	3.9	0.8	4.9
Trobriand	2.0	2.0	1.7	2.5	3.2	0.0	1.8	0.8	3.9	6.7
Chuuk	3.2	2.9	2.6	1.6	4.0	1.8	0.0	1.2	4.8	5.8
Manus	2.8	2.7	2.4	1.6	3.9	0.8	1.2	0.0	4.6	6.7
Tonga	1.9	2.0	2.3	6.1	0.8	3.9	4.8	4.6	0.0	5.0
Hawaii	5.7	5.3	5.4	7.2	4.9	6.7	5.8	6.7	5.0	0.0

# Data / coords
CULTURE_ID, CULTURE = pd.factorize(KLINE.culture.values)
ISLAND_DISTANCES = ISLANDS_DISTANCE_MATRIX.values.astype(float)
TOOLS = KLINE.total_tools.values.astype(int)
coords = {"culture": CULTURE}

with pm.Model(coords=coords) as distance_model:

    # Priors
    alpha_bar = pm.Normal("alpha_bar", 3, 0.5)
    eta_squared = pm.Exponential("eta_squared", 2)
    rho_squared = pm.Exponential("rho_squared", 0.5)

    # Gaussian Process
    kernel_function = eta_squared * pm.gp.cov.ExpQuad(input_dim=1, ls=rho_squared)
    GP = pm.gp.Latent(cov_func=kernel_function)
    alpha = GP.prior("alpha", X=ISLAND_DISTANCES, dims="culture")

    # Likelihood
    lambda_T = pm.math.exp(alpha_bar + alpha[CULTURE_ID])
    pm.Poisson("T", lambda_T, dims="culture", observed=TOOLS)

pm.model_to_graphviz(distance_model)

../_images/8ab35b4976772ea7b3d187ce8b097c47f30a9884c48067b7646a55e6e3a50b32.svg

Check model with prior-predictive simulation#

def plot_predictive_covariance(predictive, n_samples=30, color="C0", label=None):

    eta_samples = predictive["eta_squared"].values[0, :n_samples] ** 0.5
    sigma_samples = 1 / predictive["rho_squared"].values[0, :n_samples] ** 0.5

    for ii, (eta, sigma) in enumerate(zip(eta_samples, sigma_samples)):
        label = label if ii == 0 else None

        kernel_function = partial(quadratic_distance_kernel, eta=eta, sigma=sigma)
        plot_kernel_function(
            kernel_function, color=color, label=label, alpha=0.5, linewidth=5, max_distance=7
        )

with distance_model:
    prior_predictive = pm.sample_prior_predictive(random_seed=12).prior

plot_predictive_covariance(prior_predictive, label="prior")
plt.ylim([0, 2])
plt.title("Prior Covariance Functions");

Sampling: [T, alpha_bar, alpha_rotated_, eta_squared, rho_squared]

../_images/b7509262da8c49311979b3984d2bef6fee6fe277438b627a3920f6c7d1524970.png

Model the data#

with distance_model:
    distance_inference = pm.sample(target_accept=0.99)

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha_bar, eta_squared, rho_squared, alpha_rotated_]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

az.summary(distance_inference, var_names=["~alpha_rotated_"])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha[Malekula]	-0.677	0.310	-1.273	-0.114	0.008	0.006	1471.0	1306.0	1.00
alpha[Tikopia]	-0.297	0.275	-0.811	0.221	0.007	0.005	1561.0	2070.0	1.00
alpha[Santa Cruz]	-0.221	0.270	-0.714	0.299	0.007	0.005	1543.0	2215.0	1.00
alpha[Yap]	0.326	0.261	-0.164	0.805	0.008	0.006	1238.0	1457.0	1.00
alpha[Lau Fiji]	0.077	0.264	-0.414	0.593	0.007	0.005	1413.0	1749.0	1.00
alpha[Trobriand]	-0.280	0.339	-0.850	0.418	0.021	0.015	307.0	271.0	1.01
alpha[Chuuk]	0.240	0.262	-0.243	0.741	0.007	0.005	1560.0	1828.0	1.00
alpha[Manus]	-0.056	0.286	-0.621	0.456	0.010	0.007	979.0	742.0	1.00
alpha[Tonga]	0.484	0.268	-0.007	1.008	0.009	0.006	1010.0	676.0	1.00
alpha[Hawaii]	0.811	0.246	0.348	1.275	0.007	0.005	1383.0	1662.0	1.00
alpha_bar	3.402	0.220	2.973	3.805	0.007	0.005	1122.0	1244.0	1.00
eta_squared	0.379	0.257	0.074	0.814	0.010	0.007	900.0	1114.0	1.00
rho_squared	0.293	0.521	0.000	1.233	0.051	0.036	158.0	134.0	1.01

Posterior Predictions#

Compare the Prior to the Posterior#

Ensure that we’ve learned something

plot_predictive_covariance(prior_predictive, color="k", label="prior")
plot_predictive_covariance(distance_inference.posterior, label="posterior")
plt.ylim([0, 2]);

../_images/37bc858b66379870cae8b5c231c4f9f3ca585899763842e100d765a357931d9f.png

def distance_to_covariance(distance, eta_squared, rho_squared):
    return eta_squared * np.exp(-rho_squared * distance**2)


def calculate_posterior_mean_covariance_matrix(inference):
    posterior_mean = inference.posterior.mean(dim=("chain", "draw"))

    posterior_eta_squared = posterior_mean["eta_squared"].values
    posterior_rho_squared = posterior_mean["rho_squared"].values

    print("Posterior Mean Kernel parameters:")
    print("eta_squared:", posterior_eta_squared)
    print("rho_squared:", posterior_rho_squared)

    model_covariance = np.zeros_like(ISLANDS_DISTANCE_MATRIX).astype(float)
    for ii in range(10):
        for jj in range(10):
            model_covariance[ii, jj] = distance_to_covariance(
                ISLAND_DISTANCES[ii, jj],
                eta_squared=posterior_eta_squared,
                rho_squared=posterior_rho_squared,
            )

    return model_covariance


def plot_posterior_mean_covariance_matrix(covariance_matrix, clim=(0, 0.4)):
    plt.matshow(covariance_matrix)
    plt.xticks(np.arange(10), labels=KLINE.culture, rotation=90)
    plt.yticks(np.arange(10), labels=KLINE.culture)
    plt.clim(clim)
    plt.grid(False)
    plt.colorbar()
    plt.title("posterior mean covariance matrix");

distance_model_covariance = calculate_posterior_mean_covariance_matrix(distance_inference)
plot_posterior_mean_covariance_matrix(distance_model_covariance)

Posterior Mean Kernel parameters:
eta_squared: 0.3787328036103038
rho_squared: 0.2930628945484335

../_images/546ee3946d2e78020ef1f4033aeb962dd9307d49d5bf22f4724da743fe653f37.png

def plot_kline_model_covarance(covariance_matric, min_alpha=0.01, alpha_gain=1, max_cov=None):
    plt.subplots(figsize=(10, 5))
    # Plot covariance
    max_cov = covariance_matric.max() if max_cov is None else max_cov
    for ii in range(10):
        for jj in range(10):
            if ii != jj:
                lat_a = KLINE.loc[ii, "lat"]
                lon_a = KLINE.loc[ii, "lon2"]

                lat_b = KLINE.loc[jj, "lat"]
                lon_b = KLINE.loc[jj, "lon2"]
                cov = covariance_matric[ii, jj]
                alpha = (min_alpha + (1 - min_alpha) * (cov / max_cov)) ** (1 / alpha_gain)
                plt.plot((lon_a, lon_b), (lat_a, lat_b), linewidth=3, color="k", alpha=alpha)

    plt.scatter(KLINE.lon2, KLINE.lat, s=KLINE.population * 0.01, zorder=10, alpha=0.9)

    # Annotate
    for ii in range(10):
        plt.annotate(
            KLINE.loc[ii, "culture"],
            (KLINE.loc[ii, "lon2"] + 1.5, KLINE.loc[ii, "lat"]),
            zorder=11,
            color="C1",
            fontsize=12,
            fontweight="bold",
        )
    plt.xlabel("longitude")
    plt.ylabel("latitude")
    plt.axis("tight")


plot_kline_model_covarance(distance_model_covariance, max_cov=0.4)
plt.title("Pure Spatial, nothing else");

../_images/dee829635dd2cbe9c59a1779ccc1f26d3eafafea2ac193a94f45cebb78f29c6b.png

Stratify by population size#

\[\begin{split} \begin{align} T_i &\sim \text{Poisson}(\lambda_i) \\ \lambda_i &= \frac{\bar \alpha P^\beta}{\gamma} \exp(\alpha_{S[i]}) \\ \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_{10} \end{pmatrix} &= \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \textbf{K} \right) \\ k_{i,j} &= \eta^2\exp(-\rho^2D_{i,j}^2)) \\ \bar \alpha, \beta, \gamma &\sim \text{Exponential}(1) \\ \eta^2 &\sim \text{Exponential}(2) \\ \rho^2 &\sim \text{Exponential}(0.5) \end{align} \end{split}\]

We now fold our steady state equation into the intercept/baseline tools equation
Include an additional $\exp$ multiplier to handle society-level offsets. Idea here being that
- if $\alpha_{S[i]}=0$, then the society is average, and we fall back to the expected value of the difference equation i.e. multiplier is one
- if $\alpha_{S[i]}>0$ then there is a boost in tool production. This boost is shared amongst spatially-local societies

with pm.Model(coords=coords) as distance_population_model:

    population = pm.Data("log_population", KLINE.logpop.values.astype(float))
    CULTURE_ID_ = pm.Data("CULTURE_ID", CULTURE_ID)

    # Priorsdistancesdistances
    alpha_bar = pm.Exponential("alpha_bar", 1)
    gamma = pm.Exponential("gamma", 1)
    beta = pm.Exponential("beta", 1)
    eta_squared = pm.Exponential("eta_squared", 2)
    rho_squared = pm.Exponential("rho_squared", 2)

    # Gaussian Process
    kernel_function = eta_squared * pm.gp.cov.ExpQuad(input_dim=1, ls=rho_squared)
    GP = pm.gp.Latent(cov_func=kernel_function)
    alpha = GP.prior("alpha", X=ISLAND_DISTANCES, dims="culture")

    # Likelihood
    lambda_T = (alpha_bar / gamma * population[CULTURE_ID_] ** beta) * pm.math.exp(
        alpha[CULTURE_ID_]
    )
    pm.Poisson("T", lambda_T, observed=TOOLS, dims="culture")

pm.model_to_graphviz(distance_population_model)

../_images/697150e4ad9e433d1571791b860f60e8c3ff89f25393916cd446fad37935c31c.svg

Prior Predictive#

with distance_population_model:
    prior_predictive = pm.sample_prior_predictive(random_seed=12).prior

plot_predictive_covariance(prior_predictive, label="prior")
plt.ylim([0, 1])
plt.title("Prior Covariance Functions");

Sampling: [T, alpha_bar, alpha_rotated_, beta, eta_squared, gamma, rho_squared]

../_images/136138124b11e6ac30878a3d373bc92b3d641c123001f33834e03cc8ee80d1da.png

with distance_population_model:
    distance_population_inference = pm.sample(target_accept=0.99)

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha_bar, gamma, beta, eta_squared, rho_squared, alpha_rotated_]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 82 seconds.

plot_predictive_covariance(distance_inference.posterior, color="k", label="empty model")
plot_predictive_covariance(distance_population_inference.posterior, label="population")
plt.ylim([0, 1]);

../_images/605b39bb55542db304e1cbfa24f5a6b5874218931f6966eaee993972a4fefbda.png

distance_population_model_covariance = calculate_posterior_mean_covariance_matrix(
    distance_population_inference
)
plot_posterior_mean_covariance_matrix(distance_population_model_covariance)

Posterior Mean Kernel parameters:
eta_squared: 0.1545764701828998
rho_squared: 0.19242529419650606

../_images/3c3810efcee2eea3de6f068fb9f06dd632a6060f65a394751b50f3d3bc942e7d.png

We can see that by including population, the degree of covariance required to explain tool use is diminished, compared to the distance-only model. i.e. population has explained away a lot more of the variation.

plot_kline_model_covarance(distance_population_model_covariance, max_cov=0.4)
plt.title("Stratifying by Population");

../_images/5e2e84e3a50515c0d828abaa0ba8b263c39cfeb4c68f03f5dfd42c041b0b3304.png

Fit population-only model for comparison#

I think this is what McElreath is doing to construct the plot that combines population size, tool use and island covariance (next plot below). Specifically, we estimate a population-only model using the analytical solution, but includes no covariance amongst islands. We then compare add the covariance plot on top of that model to demonstrate that there is still some information left on the table regarding covariances, even when including population in the model.

ETA = 5.1  # Exponential hyperparmeter taken from Lecture 10 notes
with pm.Model() as population_model:

    # Note: raw population here, not log/standardized
    population = pm.Data("population", KLINE.logpop.values)

    # Priors
    # innovation rate
    alpha = pm.Exponential("alpha", eta)

    # Contact-level elasticity
    beta = pm.Exponential("beta", eta)

    # G lobal technology loss rate
    gamma = pm.Exponential("gamma", eta)

    # Likelihood using difference equation equilibrium as mean Poisson rate
    lamb = (alpha * (population**beta)) / gamma
    pm.Poisson("tools", lamb, observed=TOOLS)

    population_inference = pm.sample(tune=2000, target_accept=0.98)
    population_inference = pm.sample_posterior_predictive(
        population_inference, extend_inferencedata=True
    )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, gamma]

Sampling 4 chains for 2_000 tune and 1_000 draw iterations (8_000 + 4_000 draws total) took 10 seconds.
Sampling: [tools]

def plot_kline_model_population_covariance(
    covariance_matric, min_alpha=0.01, alpha_gain=1, max_cov=None
):

    # Plot covariancef
    max_cov = covariance_matric.max() if max_cov is None else max_cov
    for ii in range(10):
        for jj in range(10):
            if ii != jj:
                logpop_a = KLINE.loc[ii, "logpop"]
                tools_a = KLINE.loc[ii, "total_tools"]

                logpop_b = KLINE.loc[jj, "logpop"]
                tools_b = KLINE.loc[jj, "total_tools"]
                cov = covariance_matric[ii, jj]
                alpha = (min_alpha + (1 - min_alpha) * (cov / max_cov)) ** (1 / alpha_gain)
                plt.plot(
                    (logpop_a, logpop_b), (tools_a, tools_b), linewidth=3, color="k", alpha=alpha
                )

    plt.scatter(KLINE.logpop, KLINE.total_tools, s=KLINE.population * 0.01, zorder=10)

    # Annotate
    for ii in range(10):
        plt.annotate(
            KLINE.loc[ii, "culture"],
            (KLINE.loc[ii, "logpop"] + 0.1, KLINE.loc[ii, "total_tools"]),
            zorder=11,
            color="C1",
            fontsize=12,
            fontweight="bold",
        )
    plt.xlabel("log population")
    plt.ylabel("total tools")
    plt.axis("tight")


def plot_kline_population_only_model_posterior_predictive(inference):
    ppd = inference.posterior_predictive["tools"]
    az.plot_hdi(
        x=KLINE.logpop,
        y=ppd,
        color="C1",
        hdi_prob=0.89,
        fill_kwargs={"alpha": 0.1},
    )

    plt.plot(
        KLINE.logpop,
        ppd.mean(dim=("chain", "draw")),
        color="C1",
        linewidth=4,
        alpha=0.5,
        label="Population-only Model",
    )
    plt.legend()


plot_kline_model_population_covariance(distance_population_model_covariance, max_cov=0.4)
plot_kline_population_only_model_posterior_predictive(population_inference)

../_images/ab6605d30d5b7d77bc0724864e3864d86cbc645c17eb4eb78b769486d71eb040.png

Though Population alone accounts for a lot of the variance in tool use (blue curve), there’s still quite a bit of residual island-island similarity that explains tool use (i.e. some of the black lines still have some weight).

Phylogenetic Regression#

Dataset: Primates301#

PRIMATES301 = utils.load_data("Primates301")
PRIMATES301.head()

	name	genus	species	subspecies	spp_id	genus_id	research_effort	brain	body	group_size	gestation	weaning	longevity	sex_maturity	maternal_investment
0	Allenopithecus_nigroviridis	Allenopithecus	nigroviridis	NaN	1	1	6.0	58.02	4655.00	40.0	NaN	106.15	276.0	NaN	NaN
1	Allocebus_trichotis	Allocebus	trichotis	NaN	2	2	6.0	NaN	78.09	1.0	NaN	NaN	NaN	NaN	NaN
2	Alouatta_belzebul	Alouatta	belzebul	NaN	3	3	15.0	52.84	6395.00	7.4	NaN	NaN	NaN	NaN	NaN
3	Alouatta_caraya	Alouatta	caraya	NaN	4	3	45.0	52.63	5383.00	8.9	185.92	323.16	243.6	1276.72	509.08
4	Alouatta_guariba	Alouatta	guariba	NaN	5	3	37.0	51.70	5175.00	7.4	NaN	NaN	NaN	NaN	NaN

301 Primate species
Life history traits
Body Mass $M$, Brain Size $B$, Social Group Size $G$
Measurement error, unobserved confounding
Missing data, we’ll fucus on the complete case analysis in this lecture

Filter out 151 Complete Cases#

Drop osbservations missing either Body Mass, Brain Size, or Group size

PRIMATES = PRIMATES301.query(
    "brain.notnull() and body.notnull() and group_size.notnull()", engine="python"
)
PRIMATES

	name	genus	species	subspecies	spp_id	genus_id	social_learning	research_effort	brain	body	group_size	gestation	weaning	longevity	sex_maturity	maternal_investment
0	Allenopithecus_nigroviridis	Allenopithecus	nigroviridis	NaN	1	1	0.0	6.0	58.02	4655.0	40.00	NaN	106.15	276.0	NaN	NaN
2	Alouatta_belzebul	Alouatta	belzebul	NaN	3	3	0.0	15.0	52.84	6395.0	7.40	NaN	NaN	NaN	NaN	NaN
3	Alouatta_caraya	Alouatta	caraya	NaN	4	3	0.0	45.0	52.63	5383.0	8.90	185.92	323.16	243.6	1276.72	509.08
4	Alouatta_guariba	Alouatta	guariba	NaN	5	3	0.0	37.0	51.70	5175.0	7.40	NaN	NaN	NaN	NaN	NaN
5	Alouatta_palliata	Alouatta	palliata	NaN	6	3	3.0	79.0	49.88	6250.0	13.10	185.42	495.60	300.0	1578.42	681.02
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
294	Trachypithecus_obscurus	Trachypithecus	obscurus	NaN	295	67	0.0	6.0	62.12	7056.0	10.00	146.63	362.93	300.0	NaN	509.56
295	Trachypithecus_phayrei	Trachypithecus	phayrei	NaN	296	67	0.0	16.0	72.84	7475.0	12.90	180.61	305.87	NaN	NaN	486.48
296	Trachypithecus_pileatus	Trachypithecus	pileatus	NaN	297	67	0.0	5.0	103.64	11794.0	8.50	NaN	NaN	NaN	NaN	NaN
298	Trachypithecus_vetulus	Trachypithecus	vetulus	NaN	299	67	0.0	2.0	61.29	6237.0	8.35	204.72	245.78	276.0	1113.70	450.50
300	Varecia_variegata_variegata	Varecia	variegata	variegata	301	68	0.0	57.0	32.12	3575.0	2.80	102.50	90.73	384.0	701.52	193.23

151 rows × 16 columns

There’s lot’s of Covariation amongst variables#

from seaborn import pairplot

pairplot(
    PRIMATES.assign(
        log_brain_volume=lambda x: np.log(x["brain"]),
        log_body_mass=lambda x: np.log(x["body"]),
        log_group_size=lambda x: np.log(x["group_size"]),
    )[["log_brain_volume", "log_body_mass", "log_group_size"]]
);

../_images/24d0eaa72423f4ca0ff8a0ff06c2d706552b4de16084c0cdd4b225419b535ec6.png

There’s no way from the sample alone to know what causes what

Causal Salad#

Throwing all predictors into a model, then interpreting their coefficients causally
Using prediction methods as a stand-in for causality
Throwing all factors into “salad” of regression terms
- performing model selection base on prediction (we showed earlier in Lecture 07 - Fitting Over & Under why this is bad to do)
- and interpreting the resulting coefficients as causal effects (we’ve shown why this is bad many times)
Controlling for phylogeny is important to do, but it’s often done mindlessly (with the “salad” approach)
Regression with phylogeny still requires a causal model

Different Hypotheses required different causal models (DAGs)#

Phylogenetic Regression#

There’s a long history of genetic evolution
We only get to measure the current end state of that process
In principle, common genetic histories should induce covariation amongst species
We should hopefully be able to model this covariation to average over all the possible histories (micro state) that could have led to genetic similarity (macro state)

Two conjoint problems#

What is the history (phylogeny)?
How do we use it to model causes and control for confounds?

1. What is the history (phylogeny)?#

How do we estimate or identify a phylogeny?

Difficulties#

high degree of uncertainty
processes are non-stationary
- different parts of the genome have different histories
- crossing-over effects
available inferential tools
- exploring tree space is difficult
- repurposing software from other domains for studying biology
phylogenies (just like social networks) do not exist.
- They are abstractions we use to capture regularities in data

2. How do we use it to model causes?#

…say we’ve obtained a phylogeny, now what?

Approaches#

no default approach
Gaussian Processes are default approach
- use phylogeny as proxy for “genetic distance”

Two equivalent formulations of Linear Regression#

You can always re-express a linear regression with a draw from a multi-variate Normal distribution

Classical Linear Regression#

The dataset $\textbf{B}$ is modeled as $D$ independent samples $B_i$ from a random a univariate normal. $$

()#\[\begin{align} B_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}\]

\[### $\text{MVNormal}$ Formulation The dataset is modeled as a single $D$-vector $\textbf{B}$ sampled from a $\text{MVNormal}$. Has more of a Gaussian-process flavor to it; allows us to formulate the Linear regressoin in order to include covariation amongst the outputs.\]

()#\[\begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \mathbf{I}\sigma^2 \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align}\]

$$

Classic Linear Regression#

# Preprocessing used in the Lecture
G = utils.standardize(np.log(PRIMATES.group_size.values))
M = utils.standardize(np.log(PRIMATES.body.values))
B = utils.standardize(np.log(PRIMATES.brain.values))

with pm.Model() as vanilla_lr_model:
    # Priors
    alpha = pm.Normal("alpha", 0, 1)
    beta_G = pm.Normal("beta_G", 0, 0.5)
    beta_M = pm.Normal("beta_M", 0, 0.5)
    sigma = pm.Exponential("sigma", 1)

    # Likelihood
    mu = alpha + beta_G * G + beta_M * M
    pm.Normal("B", mu=mu, sigma=sigma, observed=B)

    vanilla_lr_inference = pm.sample()

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta_G, beta_M, sigma]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.

$\text{MVNormal}$ Linear Regression#

with pm.Model() as vector_lr_model:

    # Priors
    alpha = pm.Normal("alpha", 0, 1)
    beta_G = pm.Normal("beta_G", 0, 0.5)
    beta_M = pm.Normal("beta_M", 0, 0.5)
    sigma = pm.Exponential("sigma", 1)

    # Likelihood
    K = np.eye(len(B)) * sigma**2
    mu = alpha + beta_G * G + beta_M * M
    pm.MvNormal("B", mu=mu, cov=K, observed=B)

    vector_lr_inference = pm.sample()

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta_G, beta_M, sigma]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.

Verify the two formulations provide the same parameter estimates#

We’re able to recover the same estimates as in the lecture

pm.summary(vanilla_lr_inference, round_to=2)

	mean	sd	hdi_3%	hdi_97%	ess_bulk	ess_tail	r_hat
alpha	-0.00	0.02	-0.03	0.03	3539.69	3113.60	1.0
beta_G	0.12	0.02	0.08	0.17	2677.36	3066.65	1.0
beta_M	0.89	0.02	0.85	0.94	2753.50	2995.51	1.0
sigma	0.22	0.01	0.20	0.24	3669.05	2850.15	1.0

pm.summary(vector_lr_inference, round_to=2)

	mean	sd	hdi_3%	hdi_97%	ess_bulk	ess_tail	r_hat
alpha	0.00	0.02	-0.03	0.03	3738.52	3261.00	1.0
beta_G	0.12	0.02	0.08	0.17	2972.04	2931.39	1.0
beta_M	0.89	0.02	0.85	0.94	2958.08	2856.22	1.0
sigma	0.22	0.01	0.19	0.24	3697.40	2635.74	1.0

From Model to Kernel#

We’d like to incorporate some residual $u_i$ into our linear regression that adjusts the expecation in a way that encodes the shared history species

\[\begin{split} \begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \mathbf{I}\sigma^2 \\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i + u_i\\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \sigma &\sim \text{Exponential}(1) \end{align} \end{split}\]

Phylogenetic distance as a proxy for species covariance#

Covariance falls off with phylogenetic distance
- path length between leaf nodes in tree (e.g. see dendrogram below)
Can use a Gaussian process in much the same way we did for island distances in the Oceanic Tools analysis

PRIMATES_DISTANCE_MATRIX301 = utils.load_data("Primates301_distance_matrix").values
# Filter out incomplete cases
PRIMATES_DISTANCE_MATRIX = PRIMATES_DISTANCE_MATRIX301[PRIMATES.index][:, PRIMATES.index]

fig, ax = plt.subplots(figsize=(6, 5))
im = ax.matshow(PRIMATES_DISTANCE_MATRIX)
ax.grid(None)
fig.colorbar(im, orientation="vertical")
plt.xlabel("species ID")
plt.ylabel("species ID")
ax.set_title("Primate Phylogenetic Distance Matrix");

../_images/f72ef925dbf8f4c01f854ee092b89f36d79786494c90eabf45d51965651bfc14.png

# Plot the distance matrix
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform

dists = squareform(PRIMATES_DISTANCE_MATRIX)
linkage_matrix = linkage(dists, "single")
plt.subplots(figsize=(8, 20))
dendrogram(linkage_matrix, orientation="right", distance_sort=True, labels=PRIMATES.name.tolist())
plt.title("primate phylogeny")
plt.grid(False)
plt.xticks([]);

../_images/b6deaa9c67c3e04461f10627e6c8d9c9f49aa94fc6f9643f2d7317b8de154894.png

Kernel Function as proxy for evolutionary dynamics#

def brownian_motion_kernel(X0, X1, a=1, c=1):
    k = 1 / a * ((X0 - c) @ (X1 - c).T)[::-1, ::-1]
    return k


plt.subplots(figsize=(5, 5))
plot_kernel_function(
    partial(brownian_motion_kernel, a=22000), max_distance=150, label='Linear\n"Brownian Motion"'
)
plot_kernel_function(
    partial(ornstein_uhlenbeck_kernel, rho=1 / 20),
    max_distance=150,
    color="C1",
    label='Ornstein-Uhlenbeck\n"Damped Brownian Motion"',
)
plt.xlabel("phylogenetic distance");

../_images/2e5fae3556f5f35ec0d6446876345b2d4e7175bce14dec96ff882283a7736aa3.png

Evolutionary model + tree structure = pattern of covariation observed at the tips#

Common simple models of evolutionary dynamics:

Brownian motion - implies in linear covariance kernel function
Damped Brownian motion - implies an L1 / Ornstein-Uhlenbeck covariance kernel function

Phylogenetic “regression” model#

\[\begin{split} \begin{align} \textbf{B} &\sim \text{MVNormal}(\mu, \textbf{K}) \\ \mathbf{K} &= \eta^2 \exp(-\rho D_{ij}) &\text{Ornstein-Uhlenbeck kernel}\\ \mu_i &= \alpha + \beta_G G_i + \beta_M M_i \\ \alpha &\sim \text{Normal}(0, 1) \\ \beta_{G, M} &\sim \text{Normal}(0, 1) \\ \eta^2 &\sim \text{HalfNormal}(1, 0.25) &\text{Maximum covariance prior} \\ \rho &\sim \text{HalfNormal}(3, 0.25) &\text{Covariance decline rate prior} \end{align} \end{split}\]

Prior Samples Ornstein-Uhlenbeck Kernel Model#

eta_squared_samples = pm.draw(pm.TruncatedNormal.dist(mu=1, sigma=0.25, lower=0.01), 100) ** 0.5
rho_samples = pm.draw(pm.TruncatedNormal.dist(mu=3, sigma=0.25, lower=0.01), 100)

plt.subplots(figsize=(5, 5))
for ii, (eta_squared_, rho_) in enumerate(zip(eta_squared_samples, rho_samples)):
    label = "prior" if ii == 0 else None
    kernel_function = partial(ornstein_uhlenbeck_kernel, eta_squared=eta_squared_, rho=rho_)
    plot_kernel_function(
        kernel_function, color="C0", label=label, alpha=0.5, linewidth=2, max_distance=1
    )

plt.ylim([-0.05, 1.5])
plt.xlabel("phylogenetic distance")
plt.ylabel("covariance")
plt.title("Covariance Function Prior");

../_images/5d9682e1ab457308438c59633d724832235b87c30a49bd79799cebd6cb7747f5.png

Distance-only model#

Get the Gaussian-process part to work, then add details
Set $\beta_M = \beta_G = 0$

# Rescale distances from 0 to 1
D = PRIMATES_DISTANCE_MATRIX / PRIMATES_DISTANCE_MATRIX.max()
assert D.max() == 1

PyMC Implementation#

Below is an implementation that uses PyMC’s Gaussian process module.

In the Oceanic Tools models above, we didn’t need the Gaussian process to have a mean function because all variables were standardized, thus we can use the default mean function of $0$
For phylogenetic regression with we no longer want a zero mean function, but we want the mean function that depends on $M$ and $G$, but not a function of the phylogenetic distance matrix. We can implement a custom mean function in pymc to handle this
Also, because our likelihood is a MVNormal, we can take advantage of conjugacy between the GP prior and the Normal likelihood, which returns a GP posterior. This means that we should use pm.gp.Marginal instead of pm.gp.Latent, which uses closed form solutions ot the posterior to speed up learning.

class MeanBodyMassSocialGroupSize(pm.gp.mean.Linear):
    """Custom mean function that separates covariates from phylogeny"""

    def __init__(self, alpha, beta_G, beta_M):
        self.alpha = alpha
        self.beta_G = beta_G
        self.beta_M = beta_M

    def __call__(self, X):
        return self.alpha + self.beta_G * G + self.beta_M * M

PRIMATE_ID, PRIMATE = pd.factorize(PRIMATES["name"], sort=False)
with pm.Model(coords={"primate": PRIMATE}) as intercept_only_phylogenetic_mvn_model:
    # Priors
    alpha = pm.Normal("alpha", 0, 1)
    sigma = pm.Exponential("sigma", 1)

    # Intercept-only model
    beta_M = 0
    beta_G = 0

    # Define the mean function
    mean_func = MeanBodyMassSocialGroupSize(alpha, beta_G, beta_M)

    # Phylogenetic distance covariance
    eta_squared = pm.TruncatedNormal("eta_squared", 1, 0.25, lower=0.01)
    rho = pm.TruncatedNormal("rho", 3, 0.25, lower=0.01)

    # For Ornstein-Uhlenbeck kernel we can use Matern 1/2 or Exponential covariance Function
    cov_func = eta_squared * pm.gp.cov.Matern12(1, ls=rho)
    # cov_func = eta_squared * pm.gp.cov.Exponential(1, ls=rho)

    # Init the GP
    gp = pm.gp.Marginal(mean_func=mean_func, cov_func=cov_func)

    # Likelihood
    gp.marginal_likelihood("B", X=D, y=B, noise=sigma)
    intercept_only_phylogenetic_mvn_inference = pm.sample(target_accept=0.9)

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, sigma, eta_squared, rho]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 30 seconds.

Below is an alternative implementation that builds the covariance function by hand, and directly models the dataset as a MVNormal with mean alpha and covariance defined by the kernel. I find that these direct MVNormal implementations track better with the lecture

def generate_L1_kernel_matrix(D, eta_squared, rho, smoothing=0.1):
    K = eta_squared * pm.math.exp(-rho * D)

    # Smooth the diagonal of the covariance matrix
    N = D.shape[0]
    K += np.eye(N) * smoothing
    return K


# with pm.Model() as intercept_only_phylogenetic_mvn_model:

#     # Priors
#     alpha = pm.Normal("alpha", 0, 1)

#     # Phylogenetic distance covariance
#     eta_squared = pm.TruncatedNormal("eta_squared", 1, 0.25, lower=.001)
#     rho = pm.TruncatedNormal("rho", 3, 0.25, lower=.001)

#     # Ornstein-Uhlenbeck kernel
#     K = pm.Deterministic('K', generate_L1_kernel_matrix(D, eta_squared, rho))
#     # K = pm.Deterministic('K', eta_squared * pm.math.exp(-rho * D))

#     # Likelihood
#     mu = pm.math.ones_like(B) * alpha
#     pm.MvNormal("B", mu=mu, cov=K, observed=B)
#     intercept_only_phylogenetic_mvn_inference = pm.sample(target_accept=.98)

az.summary(intercept_only_phylogenetic_mvn_inference, var_names=["alpha", "eta_squared", "rho"])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha	0.080	0.704	-1.279	1.355	0.011	0.011	3952.0	2810.0	1.0
eta_squared	1.063	0.236	0.633	1.519	0.005	0.003	2611.0	1753.0	1.0
rho	2.973	0.251	2.510	3.448	0.004	0.003	4027.0	2757.0	1.0

# Sample the prior for comparison
with intercept_only_phylogenetic_mvn_model:
    intercept_only_phylogenetic_mvn_prior = pm.sample_prior_predictive().prior

az.summary(intercept_only_phylogenetic_mvn_prior, var_names=["alpha", "eta_squared", "rho"])

Sampling: [B, alpha, eta_squared, rho, sigma]
arviz - WARNING - Shape validation failed: input_shape: (1, 500), minimum_shape: (chains=2, draws=4)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha	-0.053	0.984	-1.989	1.755	0.045	0.032	484.0	412.0	NaN
eta_squared	1.001	0.254	0.548	1.463	0.012	0.009	444.0	339.0	NaN
rho	3.014	0.247	2.541	3.431	0.011	0.008	465.0	447.0	NaN

Compare the Prior and Posterior for distance-only model#

def calculate_mean_covariance_matrix(inference):
    def distance_to_covariance(distance, eta_squared, rho):
        return eta_squared * np.exp(-rho * np.abs(distance))

    inference_mean = inference.mean(dim=("chain", "draw"))

    eta_squared = inference_mean["eta_squared"].values
    rho = inference_mean["rho"].values

    print("Mean Kernel parameters:")
    print("eta_squared:", eta_squared)
    print("rho:", rho)

    n_species = D.shape[0]
    model_covariance = np.zeros_like(D).astype(float)
    for ii in range(n_species):
        for jj in range(n_species):
            model_covariance[ii, jj] = distance_to_covariance(
                D[ii, jj], eta_squared=eta_squared, rho=rho
            )
    return model_covariance


def plot_mean_covariance_matrix(inference, title=None, max_cov=1.0):
    mean_cov = calculate_mean_covariance_matrix(inference)

    plt.imshow(mean_cov)
    # plt.imshow(inference['K'].mean(dim=('chain', 'draw')).values)
    plt.grid(None)
    plt.clim((0, max_cov))
    plt.colorbar()
    plt.xlabel("species ID")
    plt.ylabel("species ID")
    plt.title(title)

# Prior
plot_mean_covariance_matrix(
    intercept_only_phylogenetic_mvn_prior,
    title="Prior Covariance\nIntercept-only phylogenetic Model",
)

Mean Kernel parameters:
eta_squared: 1.0006788360203365
rho: 3.0138977110438527

../_images/309d6ac0bafcfc9920db2e0502969d513c3590bcacda1b0601c2b4b1436a3c33.png

# Posterior
plot_mean_covariance_matrix(
    intercept_only_phylogenetic_mvn_inference.posterior,
    title="Prior Covariance\nIntercept-only phylogenetic Model",
)

Mean Kernel parameters:
eta_squared: 1.062638229865811
rho: 2.9733158548987166

../_images/4768221c00ea2520d35320c24458d20c7ee72d7acbb53a0f54cd5debc953e41f.png

Compare the Prior and Posterior kernel functions#

def plot_predictive_kernel_function(predictive, n_samples=200, color="C0", label=None):

    eta_squared_samples = predictive["eta_squared"].values[0, :n_samples]
    rho_samples = predictive["rho"].values[0, :n_samples]

    for ii, (eta_squared_, rho_) in enumerate(zip(eta_squared_samples, rho_samples)):
        label = label if ii == 0 else None

        kernel_function = partial(ornstein_uhlenbeck_kernel, eta_squared=eta_squared_, rho=rho_)
        plot_kernel_function(
            kernel_function, color=color, label=label, alpha=0.1, linewidth=2, max_distance=1
        )

    plt.ylim([-0.05, 2.5])
    plt.xlabel("phylogenetic distance")

plot_predictive_kernel_function(intercept_only_phylogenetic_mvn_prior, label="prior", color="k")
plot_predictive_kernel_function(
    intercept_only_phylogenetic_mvn_inference.posterior, label="posterior B", color="C1"
)

../_images/ee8dc13a19b258068ac17f7e918316af1bee4d6c1c5c3f25f7fbc405f47c8196.png

We can see comparing the prior and posterior distirbution over kernel functions, that the model has learned from the data and attenuated our expectation on maximum covariance.

Fit the full Phylogentic model.#

This model stratifies by Group size $G$ and Body Mass $M$

Below is an implementation that uses PyMC’s Gaussian process module.

with pm.Model(coords=coords) as full_phylogenetic_mvn_model:

    # Priors
    alpha = pm.Normal("alpha", 0, 1)
    sigma = pm.Exponential("sigma", 1)

    # Intercept-only model
    beta_M = pm.Normal("beta_M", 0, 1)
    beta_G = pm.Normal("beta_G", 0, 1)

    # Define the mean function
    mean_func = MeanBodyMassSocialGroupSize(alpha, beta_G, beta_M)

    # Phylogenetic distance covariance
    eta_squared = pm.TruncatedNormal("eta_squared", 1, 0.25, lower=0.01)
    rho = pm.TruncatedNormal("rho", 3, 0.25, lower=0.01)

    # For Ornstein-Uhlenbeck kernel we can use Matern 1/2 or Exponential covariance Function
    cov_func = eta_squared * pm.gp.cov.Matern12(1, ls=rho)
    # cov_func = eta_squared * pm.gp.cov.Exponential(1, ls=rho)

    # Init the GP
    gp = pm.gp.Marginal(mean_func=mean_func, cov_func=cov_func)

    # Likelihood
    gp.marginal_likelihood("B", X=D, y=B, noise=sigma)
    full_phylogenetic_mvn_inference = pm.sample(target_accept=0.9)

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, sigma, beta_M, beta_G, eta_squared, rho]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 53 seconds.
There were 7 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details

Below is an alternative implementation that builds the covariance function by hand, and directly models the dataset as a MVNormal with mean being the linear function of $G$ and $M$, and covariance defined by the kernel. I find that these MVNormal implementations track better with the lecture.

# with pm.Model() as full_phylogenetic_mvn_model:

#     # Priors
#     alpha = pm.Normal("alpha", 0, 1)
#     beta_G = pm.Normal("beta_G", 0, 0.5)
#     beta_M = pm.Normal("beta_M", 0, 0.5)

#     # Phylogenetic distance covariance
#     eta_squared = pm.TruncatedNormal("eta_squared", 1, .25, lower=.001)
#     rho = pm.TruncatedNormal("rho", 3, .25, lower=.001)
#     K = pm.Deterministic('K', generate_L1_kernel_matrix(D, eta_squared, rho))
#     # K = pm.Deterministic('K', eta_squared * pm.math.exp(-rho * D))

#     mu = alpha + beta_G * G + beta_M * M

#     # Likelihood
#     pm.MvNormal("B", mu=mu, cov=K, observed=B)
#     full_phylogenetic_mvn_inference = pm.sample(target_accept=.98)

az.summary(
    full_phylogenetic_mvn_inference, var_names=["alpha", "eta_squared", "rho", "beta_M", "beta_G"]
)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha	-0.002	0.630	-1.197	1.194	0.011	0.011	3395.0	1985.0	1.00
eta_squared	0.726	0.318	0.015	1.199	0.010	0.007	972.0	471.0	1.01
rho	3.051	0.249	2.601	3.518	0.004	0.003	4011.0	2389.0	1.00
beta_M	0.901	0.023	0.858	0.942	0.000	0.000	2733.0	2739.0	1.00
beta_G	0.108	0.024	0.065	0.156	0.000	0.000	2669.0	2849.0	1.00

Compare covariance terms in distance-only and full phylogenetic regression models#

# Posterior
plot_mean_covariance_matrix(
    full_phylogenetic_mvn_inference.posterior,
    title="Prior Covariance\nIntercept-only phylogenetic Model",
)

Mean Kernel parameters:
eta_squared: 0.7258975020904833
rho: 3.0511929927563934

../_images/9e54d9a9a56f1de2fe04826456f2c15be20cc0128b7775d9815c069b2e2ab92e.png

plot_predictive_kernel_function(intercept_only_phylogenetic_mvn_prior, label="prior", color="k")
plot_predictive_kernel_function(
    intercept_only_phylogenetic_mvn_inference.posterior, label="posterior B", color="C1"
)
plot_predictive_kernel_function(full_phylogenetic_mvn_inference.posterior, label="posterior BMG")

../_images/79a2544be5736ce121f5a1c6419d8ca662aae0e881fbaee85c3cfe386a0ad1b6.png

We can see that when including regressors for body mass and group size, the amount of phylogenetic covaration used to explain brain size is greatly diminished, compared to the phylogenetic-distance-only model.

The results for full-phylogenetic model reported here make sense to me too, given that we can explain away a lot of the variability in brain size by primarily body mass, after we’ve controlled for phylogenetic history; this in turn makes the phylogenetic history receive less weight in the model.

Influence of Group Size on Brain Size#

utils.draw_causal_graph(
    edge_list=[("G", "B"), ("M", "G"), ("M", "B"), ("u", "M"), ("u", "G"), ("u", "B"), ("h", "u")],
    node_props={
        "h": {"label": "history\n(phylogeny)", "color": "lightgray", "fontcolor": "lightgray"},
        "u": {
            "style": "dashed",
            "label": "confounds, u",
            "color": "lightgray",
            "fontcolor": "lightgray",
        },
        "unobserved": {"style": "dashed", "color": "lightgray", "fontcolor": "lightgray"},
    },
    edge_props={
        ("G", "B"): {"color": "blue"},
        ("M", "G"): {"color": "red"},
        ("M", "B"): {"color": "red"},
        ("u", "B"): {"color": "lightgray"},
        ("u", "M"): {"color": "lightgray"},
        ("u", "G"): {"color": "lightgray"},
        ("h", "u"): {
            "color": "lightgray",
        },
    },
    graph_direction="LR",
)

../_images/f26f74baf9635b0592345caf478424434f0052c7dd5e5b91606479d3eaf9b435.svg

We ignore the confounds due to genetic history (light gray)
- thus we don’t try to control for potential confounds due to genetic history.
To get the direct effect of $G$ on $B$ we’ll also stratify by $M$ to block the backdoor path through the fork passing through $M$ .

with pm.Model() as ordinary_model:

    # Priors
    alpha = pm.Normal("alpha", 0, 1)
    beta_G = pm.Normal("beta_G", 0, 0.5)
    beta_M = pm.Normal("beta_M", 0, 0.5)

    # Independent species (equal variance)
    sigma = pm.Exponential("sigma", 1)
    K = np.eye(len(B)) * sigma

    mu = alpha + beta_G * G + beta_M * M

    pm.MvNormal("B", mu=mu, cov=K, observed=B)
    ordinary_inference = pm.sample()

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta_G, beta_M, sigma]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.

az.summary(ordinary_inference, var_names=["alpha", "beta_M", "beta_G"])

	mean	sd	hdi_3%	hdi_97%	ess_bulk	ess_tail	r_hat
alpha	-0.000	0.018	-0.035	0.032	4259.0	2653.0	1.0
beta_M	0.892	0.023	0.850	0.936	3054.0	3040.0	1.0
beta_G	0.123	0.023	0.078	0.165	2941.0	2882.0	1.0

def plot_primate_posterior_comparison(variable, ax, title=None):
    az.plot_dist(ordinary_inference.posterior[variable], color="k", label="ordinary", ax=ax)
    az.plot_dist(
        full_phylogenetic_mvn_inference.posterior[variable],
        color="C0",
        label="Ornstein-Uhlenbeck",
        ax=ax,
    )
    ax.axvline(0, color="k", linestyle="--")
    ax.set_xlabel(f"$\\{variable}$")
    ax.set_ylabel("density")
    ax.set_title(title)


_, axs = plt.subplots(1, 2, figsize=(10, 5))
plot_primate_posterior_comparison("beta_G", axs[0], "Effect of Group Size\non Brain Size")
plot_primate_posterior_comparison("beta_M", axs[1], "Effect of Body Mass\non Brain size")

../_images/5e642db4599cdb0d3780f1b40c67c5a0ff7d97115127617bb885c34cc0581e9d.png

By trying to control for potential confounds due to genetic history, we have nearly halved the estimated effect of Group size on Brain size $\beta_G$, though it’s still mostly positive.

This estimator can also give us the direct effect of Body Mass $M$ on Brain Size $B$ (by blocking the Pipe through $G$). When we look at the $\beta_M$, we see that there is a slight increase in the effect of body mass on brain size when controlling for genetic history confounds, though the effect is large for both estimators indicating that body mass is a large driver of brain size.

Summary: Phylogenetic Regression#

Potential Issues#

What about uncertainty in phylogeny?
- better to perform phylogenic inference simultaneously – i.e. estimate a posterior on phylogenies
What about reciprocal causation?
- feedback between organism and environment
- there’s no unidirectional cause
- thus regression is likely not the best option
- new were methods include differential equations to pose the problem as a multi-objective optimization problem

Summary: Gaussian Processes#

Provides means for (local) partial pooling for continuous groups/categories
General approximation engine – good for prediction
Robust to overfitting
Sensitive to priors: prior predictive simiulation is very important

Authors#

Ported to PyMC by Dustin Stansbury (2024)
Based on Statistical Rethinking (2023) lectures by Richard McElreath

%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,aeppl,xarray

Last updated: Tue Dec 17 2024

Python implementation: CPython
Python version       : 3.12.5
IPython version      : 8.27.0

pytensor: 2.26.4
aeppl   : not installed
xarray  : 2024.7.0

seaborn    : 0.13.2
scipy      : 1.14.1
matplotlib : 3.9.2
numpy      : 1.26.4
xarray     : 2024.7.0
statsmodels: 0.14.2
pymc       : 5.19.1
arviz      : 0.19.0
pandas     : 2.2.2

Watermark: 2.5.0

License notice#

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples#

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Important

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

which once rendered could look like:

Categories

Tags

Gaussian Processes#

Returning to Oceanic Technology Dataset#

Review of Techonological Innovation/Loss model#

Let’s start of with an easier model that ignores population#

Gaussian Processes (in the abstract)#

Different Kernel Functions#

The Quadratic (L2) Kernel Function#

Ornstein-Uhlenbeck Kernel#

Periodic Kernel#

Gaussian Process Prior#

1-D Examples#

Varying \(\sigma\)#

Varying \(\eta\)#

Gaussian Process Posterior#

Distance-based model#

Islands Distance Matrix#

Check model with prior-predictive simulation#

Model the data#

Posterior Predictions#

Compare the Prior to the Posterior#

Stratify by population size#

Prior Predictive#

Fit population-only model for comparison#

Phylogenetic Regression#

Dataset: Primates301#

Filter out 151 Complete Cases#

There’s lot’s of Covariation amongst variables#

Causal Salad#

Different Hypotheses required different causal models (DAGs)#

Social Brain Hypothesis#

Body Mass \(M\) is a shared cause for both \(M\) and \(B\)#

Body Mass \(M\) is a mediator for both \(M\) and \(B\)#

Brain Size \(B\) actually causes Group Size \(G\); \(M\) still a shared cause of both#

Unobserved confounds make naive regression on traits w/out controlling for history is dangerous#

Phylogenetic Regression#

Two conjoint problems#

1. What is the history (phylogeny)?#

Difficulties#

2. How do we use it to model causes?#

Approaches#

Two equivalent formulations of Linear Regression#

Classical Linear Regression#

Classic Linear Regression#

\(\text{MVNormal}\) Linear Regression#

Verify the two formulations provide the same parameter estimates#

From Model to Kernel#

Phylogenetic distance as a proxy for species covariance#

Kernel Function as proxy for evolutionary dynamics#

Evolutionary model + tree structure = pattern of covariation observed at the tips#

Phylogenetic “regression” model#

Prior Samples Ornstein-Uhlenbeck Kernel Model#

Distance-only model#

PyMC Implementation#

Compare the Prior and Posterior for distance-only model#

Compare the Prior and Posterior kernel functions#

Fit the full Phylogentic model.#

Compare covariance terms in distance-only and full phylogenetic regression models#

Influence of Group Size on Brain Size#

Summary: Phylogenetic Regression#

Potential Issues#

Summary: Gaussian Processes#

Authors#

License notice#

Citing PyMC examples#