Gaussian Processes: HSGP Reference & First Steps#

The Hilbert Space Gaussian processes approximation is a low-rank GP approximation that is particularly well-suited to usage in probabilistic programming languages like PyMC. It approximates the GP using a pre-computed and fixed set of basis functions that don’t depend on the form of the covariance kernel or its hyperparameters. It’s a parametric approximation, so prediction in PyMC can be done as one would with a linear model via pm.Data or pm.set_data. You don’t need to define the .conditional distribution that non-parameteric GPs rely on. This makes it much easier to integrate an HSGP, instead of a GP, into your existing PyMC model. Additionally, unlike many other GP approximations, HSGPs can be used anywhere within a model and with any likelihood function.

It’s also fast. The computational cost for unapproximated GPs per MCMC step is \(\mathcal{O}(n^3)\), where \(n\) is the number of data points. For HSGPs, it is \(\mathcal{O}(mn + m)\), where \(m\) is the number of basis vectors. It’s important to note that sampling speeds are also very strongly determined by posterior geometry.

The HSGP approximation does carry some restrictions:

It can only be used with stationary covariance kernels such as the Matern family. The HSGP class is compatible with any Covariance class that implements the power_spectral_density method. There is a special case made for the Periodic covariance, which is implemented in PyMC by The HSGPPeriodic.
It does not scale well with the input dimension. The HSGP approximation is a good choice if your GP is over a one dimensional process like a time series, or a two dimensional spatial point process. It’s likely not an efficient choice where the input dimension is larger than three.
It may struggle with more rapidly varying processes. If the process you’re trying to model changes very quickly relative to the extent of the domain, the HSGP approximation may fail to accurately represent it. We’ll show in later sections how to set the accuracy of the approximation, which involves a trade-off between the fidelity of the approximation and the computational complexity.
For smaller data sets, the full unapproximated GP may still be more efficient.

A secondary goal of this implementation is flexibility via an accessible implementation where the core computations are implemented in a modular way. For basic usage, users can use the .prior and .conditional methods and essentially treat the HSGP class as a drop in replacement for pm.gp.Latent, the unapproximated GP. More advanced users can bypass those methods and work with .prior_linearized instead, which exposes the HSGP as a parametric model. For more complex models with multiple HSGPs, users can work directly with functions like pm.gp.hsgp_approx.calc_eigenvalues and pm.gp.hsgp_approx.calc_eigenvectors.

References:#

Original reference: Solin & Sarkka, 2019.
HSGPs in probabilistic programming languages: Riutort-Mayol et al., 2020.
HSTPs (Student-t process): Sellier & Dellaportas, 2023.
Kronecker HSGPs: Dan et al., 2022
PyMC’s HSGP API

Example 1: Basic HSGP Usage#

We’ll use simulated data to motivate an overview of the usage of HSGP. Refer to this section if you’re interested in:

Seeing a simple example of HSGP in action.
Replacing a standard GP, i.e. pm.gp.Latent, with a faster approximation – as long as you’re using one of the more common covariance kernels, like ExpQuad, Matern52 or Matern32.
Understanding when to use the centered or the non-centered parameterization.
A quick example of additive GPs

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import preliz as pz
import pymc as pm
import pytensor.tensor as pt

# Sample on the CPU
%env CUDA_VISIBLE_DEVICES=''
# import jax
# import numpyro
# numpyro.set_host_device_count(6)

env: CUDA_VISIBLE_DEVICES=''

az.style.use("arviz-whitegrid")
plt.rcParams["figure.figsize"] = [12, 5]
%config InlineBackend.figure_format = 'retina'
seed = sum(map(ord, "hsgp"))
rng = np.random.default_rng(seed)

Simulate data#

def simulate_1d(x, ell_true, eta_true, sigma_true):
    """Given a domain x, the true values of the lengthscale ell, the
    scale eta, and the noise sigma, simulate a one-dimensional GP
    observed at the given x-locations.
    """

    # Draw one sample from the underlying GP.
    n = len(x)
    cov_func = eta_true**2 * pm.gp.cov.Matern52(1, ell_true)
    cov_mat_stabilized = pm.gp.util.stabilize(cov_func(x[:, None]))
    gp_true = pm.MvNormal.dist(mu=np.zeros(n), cov=cov_mat_stabilized)
    f_true = pm.draw(gp_true, draws=1, random_seed=rng)

    # The observed data is the latent function plus a small amount
    # of Gaussian distributed noise.
    noise_dist = pm.Normal.dist(mu=0.0, sigma=sigma_true)
    y_obs = f_true + pm.draw(noise_dist, draws=n, random_seed=rng)
    return y_obs, f_true

fig = plt.figure(figsize=(10, 4))
ax = fig.gca()

x = 100.0 * np.sort(np.random.rand(2000))
y_obs, f_true = simulate_1d(x=x, ell_true=1.0, eta_true=1.0, sigma_true=1.0)
ax.plot(x, f_true, color="dodgerblue", lw=2, label="True underlying GP 'f'")
ax.scatter(x, y_obs, marker="o", color="black", s=5, label="Observed data")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend(frameon=True)
ax.grid(False)

../_images/c1dfd858e3612e7b00d4f8e35274178fee3488b00067d86d5f48c01d715801d2.png

Define and fit the HSGP model#

First we use pz.maxent to choose our prior for the lengthscale parameter, maxent return the maximum entropy prior with the specified mass within the interval [lower, upper].

We use a Lognormal to penalize very small lengthscales while having a heavy right tail. When the signal from the GP is high relative to the noise, we are able to use more informative priors.

lower, upper = 0.5, 5.0
ell_dist, ax = pz.maxent(
    pz.LogNormal(),
    lower=lower,
    upper=upper,
    mass=0.9,
    plot_kwargs={"support": (0, 7), "legend": None},
)

ax.set_title(r"Prior for the lengthscale, $\ell$")

Text(0.5, 1.0, 'Prior for the lengthscale, $\\ell$')

../_images/46feec01de95d3d481933fb07b2894c0c81dfe7c80c4d4b37edc1aced67bedfd.png

There are a few things to note about the model code below:

The approximation parameters m and c control the approximation fidelity to computational complexity tradeoff. We’ll see in a later section how to choose these values. In short, choosing a larger m helps improve the approximation of smaller lengthscales and other short distance variations that the GP has to fit. Choosing a larger c helps improve the approximation of longer and slower changes.
We chose the centered parameterization because the true underlying GP is strongly informed by the data. You can read more about centered vs. non-centered here and here. In the HSGP class, the default is non-centered, which works better for the, arguably more common, case where the underlying GP is weakly informed by the observed data.

with pm.Model(coords={"basis_coeffs": np.arange(200), "obs_id": np.arange(len(y_obs))}) as model:
    ell = ell_dist.to_pymc("ell")
    eta = pm.Exponential("eta", scale=1.0)
    cov_func = eta**2 * pm.gp.cov.Matern52(input_dim=1, ls=ell)

    # m and c control the fidelity of the approximation
    m, c = 200, 1.5
    parametrization = "centered"
    gp = pm.gp.HSGP(m=[m], c=c, parametrization=parametrization, cov_func=cov_func)
    # Compare to the code for the full, unapproximated GP:
    # gp = pm.gp.Latent(cov_func=cov_func)
    f = gp.prior("f", X=x[:, None], hsgp_coeffs_dims="basis_coeffs", gp_dims="obs_id")

    sigma = pm.Exponential("sigma", scale=1.0)
    pm.Normal("y_obs", mu=f, sigma=sigma, observed=y_obs, dims="obs_id")

    idata = pm.sample(chains=4)
    idata.extend(pm.sample_posterior_predictive(idata, random_seed=rng))

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 2 jobs)
NUTS: [ell, eta, f_hsgp_coeffs, sigma]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 280 seconds.
Sampling: [y_obs]

az.summary(idata, var_names=["eta", "ell", "sigma"], round_to=2)

	mean	sd	hdi_3%	hdi_97%	ess_bulk	ess_tail	r_hat
eta	0.92	0.08	0.77	1.08	3417.69	3082.10	1.0
ell	0.99	0.12	0.77	1.20	1223.26	2022.17	1.0
sigma	1.01	0.02	0.98	1.04	6696.98	2869.34	1.0

az.plot_trace(
    idata,
    var_names=["eta", "ell", "sigma"],
    lines=[("eta", {}, [1]), ("ell", {}, [1]), ("sigma", {}, [1])],
);

../_images/4004fbb06ef63dd3616aa5239ee1443bba50378d35adb1a5bbc411905fb87543.png

Fitting went all good, so we can go ahead and plot the inferred GP, as well as the posterior predictive samples.

Posterior predictive plot#

fig = plt.figure(figsize=(10, 4))
ax = fig.gca()

f = az.extract(idata.posterior.sel(draw=slice(None, None, 10)), var_names="f")
y_preds = az.extract(idata.posterior_predictive.sel(draw=slice(None, None, 10)), var_names="y_obs")

ax.plot(x, y_preds, color="#AAC4E6", alpha=0.02)
ax.plot(x, f, color="#70133A", alpha=0.1)
ax.scatter(x, y_obs, marker="o", color="grey", s=15, label="Observed data")
ax.plot(x, f_true, color="#FBE64D", lw=2, label="True underlying GP 'f'")

ax.set(title="The HSGP Fit", xlabel="x", ylabel="y")
ax.legend(frameon=True, fontsize=11, ncol=2, loc="lower left");

../_images/3c1925ece9f151068e7dcf1d4e409465986b93e498dfe151e21cbb1daaff3a56.png

The inferred underlying GP (in bordeaux) accurately matches the true underlying GP (in yellow). We also see that the posterior predictive samples (in light blue) fit the observed data really well.

Additive GPs

HSGP is compatible with additive covariances, instead of defining two completely independent HSGPs.

Instead of constructing and then directly adding them, the sum of two HSGPs can be computed more efficiently by first taking the sum of their power spectral densities, and then creating a single GP from the combined power spectral density. This reduces the number of unknown parameters because the two GPs can share the same basis set.

The code for this would look similar to:

cov1 = eta1**2 * pm.gp.cov.ExpQuad(input_dim, ls=ell1)
cov2 = eta2**2 * pm.gp.cov.Matern32(input_dim, ls=ell2)
cov = cov1 + cov2

gp = pm.gp.HSGP(m=[m], c=c, cov_func=cov_func)

Choosing the HSGP approximation parameters, `m`, `L`, and `c`#

Before fitting a model with an HSGP, you have to choose m and c or L. m is the number of basis vectors. Recall that the computational complexity of the HSGP approximation is \(\mathcal{O}(mn + m)\), where \(n\) is the number of data points.

This choice is a balance between three concerns:

The accuracy of the approximation.
Reducing the computational burden.
The X locations where predictions or forecasts will need to be made.

At the end of this section, we’ll give the rules of thumb given in Ruitort-Mayol et. al.. The best way to understand how to choose these parameters is to understand how m, c and L relate to each other, which requires understanding a bit more about how the approximation works under the hood.

How `L` and `c` affect the basis#

Speaking non-technically, the HSGP approximates the GP prior as a linear combination of sinusoids. The coefficients of the linear combination are IID normal random variables whose standard deviation depends on GP hyperparameters (which are an amplitude and lengthscale for the Matern family). Users are who are interested in further introductory details should see this fantastic blog post by Juan Ordiz.

To see this, we’ll make a few plots of the \(m=3\) and \(m=5\) basis vectors and pay careful attention to how they behave at the boundaries of the domain. Note that we have to center the x data first, and then choose L in relation to the centered data. It’s worth mentioning here that the basis vectors we’re plotting do not depend on either the choice of the covariance kernel or on any unknown parameters the covariance function has.

# Our data goes from x=-5 to x=5
x = np.linspace(-5, 5, 1000)

# (plotting code)
fig, axs = plt.subplots(1, 3, figsize=(14, 4), sharey=True, constrained_layout=True)

ylim = 0.55
axs[0].set_ylim([-ylim, ylim])
axs[1].set_yticks([])
axs[1].set_xlabel("xs (mean subtracted x)")
axs[2].set_yticks([])

# change L as we create the basis vectors
L_options = [5.0, 6.0, 20.0]
m_options = [3, 3, 5]
for i, ax in enumerate(axs.flatten()):
    L = L_options[i]
    m = m_options[i]

    eigvals = pm.gp.hsgp_approx.calc_eigenvalues(pt.as_tensor([L]), [m])
    phi = pm.gp.hsgp_approx.calc_eigenvectors(
        x[:, None],
        pt.as_tensor([L]),
        eigvals,
        [m],
    ).eval()

    for j in range(phi.shape[1]):
        ax.plot(x, phi[:, j])

    ax.set_xticks(np.arange(-5, 6, 5))

    S = 5.0
    c = L / S
    ax.text(-4.9, -0.45, f"L = {L}\nc = {c}", fontsize=15)
    ax.set(title=f"{m} basis functions")

plt.suptitle("The effect of changing $L$ on the HSGP basis vectors", fontsize=18);

../_images/0db0622ed8eada6a02921888ea4ae2f18fa39b3f39004ab1eff835463e51bf20.png

The first and middle panels have 3 basis functions, and the rightmost has 5.

Notice that both L and m are specified as lists, to allow setting L and m per input dimension. In this example these are both one element lists since our example is in a one dimensional, time series like context. Before continuing, it’s helpful to define \(S\) as the half range of the centered data, or the distance from the midpoint at \(x=0\) to the edge, \(x=5\). In this example \(S=5\) for each plot panel. Then, we can define \(c\) such that it relates \(S\) to \(L\),

\[ L = c \cdot S \,. \]

It’s usually easier to set \(L\) by choosing \(c\), which acts as a multiplier on \(S\).

In the left-most plot we chose \(L=S=5\), which is exactly on the edge of our x locations. For any \(m\), all the basis vectors are forced to pinch to zero at the edges, at \(x=-5\) and \(x=5\). This means that the HSGP approximation becomes poor as you get closer to \(x=-5\) and \(x=5\). How quickly depends on the lengthscale. Large lengthscales require larger values of \(L\) and \(c\), and smaller lengthscales attenuate this issue. Ruitort-Mayol et. al. recommend using 1.2 as a minimum value. The effect of this choice on the basis vectors is shown in the center panel. In particular, we can now see that the basis vectors are not forced to pinch to zero.

The right panel shows the effect of choosing a larger \(L\), or setting \(c=4\). Larger values of \(L\) or \(c\) make the boundary conditions less problematic, and are required to accurately approximate GPs with longer lengthscales. You also need to consider where predictions will need to be made. In addition to the locations of the observed \(x\) values, the locations of the new \(x\) locations also need to be away from the “pinch” caused by the boundary condition. The period of the basis functions also increases as we increase \(L\) or \(c\). This means that we will need to increase \(m\), the number of basis vectors, in order to compensate if we wish to approximate GPs with smaller lengthscales.

With large \(L\) or \(c\), the first eigenvector can flatten so much that it becomes partially or completely unidentifiable with the intercept in the model. The right-most panel is an example of this (see the blue basis vector). It can be very beneficial to sampling to drop the first eigenvector in these situations. The HSGP and HSGPPeriodic class in PyMC both have the option drop_first to do this, or if you’re using .prior_linearized you can control this yourself. Be sure to check the basis vectors if the sampler is having issues.

To summarize:

Increasing \(m\) helps the HSGP approximate GPs with smaller lengthscales, at the cost of increasing computational complexity.
Increasing \(c\) or \(L\) helps the HSGP approximate GPs with larger lengthscales, but may require increasing \(m\) to compensate for the loss of fidelity at smaller lengthscales.
When choosing \(m\), \(c\) or \(L\), it’s important to consider the locations where you will need to make predictions, such that they also aren’t affected by the boundary condition.
The first basis vector may be unidentified with the intercept, especially when \(L\) or \(c\) are larger.

Heuristics for choosing \(m\) and \(c\)#

In practice, you’ll need to infer the lengthscale from the data, so the HSGP needs to approximate a GP across a range of lengthscales that are representative of your chosen prior. You’ll need to choose \(c\) large enough to handle the largest lengthscales you might fit, and also choose \(m\) large enough to accommodate the smallest lengthscales.

Ruitort-Mayol et. al. give some handy heuristics for the range of lengthscales that are accurately reproduced for given values of \(m\) and \(c\). Below, we provide a function that uses their heuristics to recommend minimum \(m\) and \(c\) value. Note that these recommendations are based on a one-dimensional GP.

For example, if you’re using the Matern52 covariance and your data ranges from \(x=-5\) to \(x=95\), and the bulk of your lengthscale prior is between \(\ell=1\) and \(\ell=50\), then the smallest recommended values are \(m=543\) and \(c=4.1\), as you can see below:

m52_m, m52_c = pm.gp.hsgp_approx.approx_hsgp_hyperparams(
    x_range=[-5, 95], lengthscale_range=[1, 50], cov_func="matern52"
)

print("Recommended smallest number of basis vectors for Matern 5/2 (m):", m52_m)
print("Recommended smallest scaling factor for Matern 5/2 (c):", np.round(m52_c, 1))

Recommended smallest number of basis vectors for Matern 5/2 (m): 543
Recommended smallest scaling factor for Matern 5/2 (c): 4.1

The HSGP approximate Gram matrix#

You may not be able to rely on these heuristics for a few reasons. You may be using a different covariance function than ExpQuad, Matern52, or Matern32. Also, they’re only defined for one dimensional GPs. Another way to check HSGP fidelity is to directly compare the unapproximated Gram matrix (the Gram matrix is the matrix obtained after calculating the covariance function over the inputs X), \(\mathbf{K}\), to the one resulting from the HSGP approximation,

\[ \tilde{\mathbf{K}} = \Phi \Delta \Phi^T \,, \]

where \(\Phi\) is the matrix of eigenvectors we use as the basis (plotted previously), and \(\Delta\) has the spectral densities computed at the eigenvalues down the diagonal. Below we show an example with a two dimensional grid of input X. It’s important to notice that the HSGP approximation requires us to center the input X data, which is done by converting X to Xs in the code below. We plot the approximate Gram matrix for varying \(L\) and \(c\) values, to see when the approximation starts to degrade for the given X locations and lengthscale choices.

## Define the X locations and calculate the Gram matrix from a given covariance function
x1, x2 = np.meshgrid(np.linspace(0, 10, 50), np.linspace(0, 10, 4))
X = np.vstack((x2.flatten(), x1.flatten())).T

# X is two dimensional, so we set input_dim=2
chosen_ell = 3.0
cov_func = pm.gp.cov.ExpQuad(input_dim=2, ls=chosen_ell)
K = cov_func(X).eval()

## Calculate the HSGP approximate Gram matrix
# Center or "scale" X so we can work with Xs (important)
X_center = (np.max(X, axis=0) + np.min(X, axis=0)) / 2.0
Xs = X - X_center

# Calculate L given Xs and c
m, c = [20, 20], 2.0
L = pm.gp.hsgp_approx.set_boundary(Xs, c)

def calculate_Kapprox(Xs, L, m):
    # Calculate Phi and the diagonal matrix of power spectral densities
    eigvals = pm.gp.hsgp_approx.calc_eigenvalues(L, m)
    phi = pm.gp.hsgp_approx.calc_eigenvectors(Xs, L, eigvals, m)
    omega = pt.sqrt(eigvals)
    psd = cov_func.power_spectral_density(omega)
    return (phi @ pt.diag(psd) @ phi.T).eval()

fig, axs = plt.subplots(2, 4, figsize=(14, 7), sharey=True, layout="tight")

axs[0, 0].imshow(K, cmap="inferno", vmin=0, vmax=1)
axs[0, 0].set(xlabel="x1", ylabel="x2", title=f"True Gram matrix\nTrue $\\ell$ = {chosen_ell}")
axs[1, 0].axis("off")
im_kwargs = {
    "cmap": "inferno",
    "vmin": 0,
    "vmax": 1,
    "interpolation": "none",
}

## column 1
m, c = [30, 30], 5.0
L = pm.gp.hsgp_approx.set_boundary(Xs, c)
K_approx = calculate_Kapprox(Xs, L, m)
axs[0, 1].imshow(K_approx, **im_kwargs)
axs[0, 1].set_title(f"m = {m}, c = {c}")

m, c = [30, 30], 1.2
L = pm.gp.hsgp_approx.set_boundary(Xs, c)
K_approx = calculate_Kapprox(Xs, L, m)
axs[1, 1].imshow(K_approx, **im_kwargs)
axs[1, 1].set(xlabel="x1", ylabel="x2", title=f"m = {m}, c = {c}")

## column 2
m, c = [15, 15], 5.0
L = pm.gp.hsgp_approx.set_boundary(Xs, c)
K_approx = calculate_Kapprox(Xs, L, m)
axs[0, 2].imshow(K_approx, **im_kwargs)
axs[0, 2].set_title(f"m = {m}, c = {c}")

m, c = [15, 15], 1.2
L = pm.gp.hsgp_approx.set_boundary(Xs, c)
K_approx = calculate_Kapprox(Xs, L, m)
axs[1, 2].imshow(K_approx, **im_kwargs)
axs[1, 2].set_title(f"m = {m}, c = {c}")

## column 3
m, c = [2, 2], 5.0
L = pm.gp.hsgp_approx.set_boundary(Xs, c)
K_approx = calculate_Kapprox(Xs, L, m)
axs[0, 3].imshow(K_approx, **im_kwargs)
axs[0, 3].set_title(f"m = {m}, c = {c}")

m, c = [2, 2], 1.2
L = pm.gp.hsgp_approx.set_boundary(Xs, c)
K_approx = calculate_Kapprox(Xs, L, m)
axs[1, 3].imshow(K_approx, **im_kwargs)
axs[1, 3].set_title(f"m = {m}, c = {c}")

for ax in axs.flatten():
    ax.grid(False)

../_images/157410688980a495784e06423bd96689d6c8ce3b12fa1424e97b38f4243726a6.png

The plots above compare the approximate Gram matrices to the unapproximated Gram matrix in the top left panel. The goal is to compare the approximated Gram matrices to the true one (upper left). Qualitatively, the more similar they look the better the approximation. Also, these results are only relevant to the context of the particular domain defined by X and the chosen lengthscale, \(\ell = 3\) – just because it looks good for \(\ell = 3\) doesn’t mean it will look good for, for instance, \(\ell = 10\).

We can make a few observations:

The approximation visually looks good for the two panels with \(m = 15\) or \(m = 30\), and with \(c=5.0\). The rest show clear differences to the unapproximated Gram matrix.
\(c=1.2\) is generally too small, regardless of \(m\).
Perhaps surprisingly, the \(m=[2, 2]\), \(c=1.2\) approximation looks better than the \(m=[2, 2]\), \(c=5\) one. As we showed earlier, when we “stretch” the eigenvector basis to fill a larger domain than our X (larger by the multiple \(c\)), we can lose fidelity at smaller lengthscales. In other words, in the second case. \(m\) is too small for the value of \(c\). That’s why the first option looks better.
The second row (\(c=1.2\)) doesn’t really improve as \(m\) increases. That’s because \(m\) is good enough to capture the smaller lengthscales, but \(c\) is always too small to capture the larger lengthscales.
The first row on the other hand shows that \(c=5\) is good enough for the larger lengthscales, and once we hit \(m=15\) we’re also able to capture the smaller ones.

For your particular situation, you will need to experiment across your range of lengthscales and quantify how much approximation error is acceptable. Often, when prototyping a model, you can use a lower fidelity HSGP approximation for faster sampling. Then, once you understand the range of relevant lengthscales, you can dial in the correct \(m\) and \(L\) (or \(c\)) values.

Be aware that it’s also possible to encounter scenarios where a low fidelity HSGP approximation gives a more parsimonious fit than a high fidelity HSGP approximation. A low fidelity HSGP approximation is still a valid prior for some unknown function, if somewhat contrived. Whether that matters will depend on your context.

Avoiding underflow issues

As noted above, the diagonal matrix \(\Delta\) used in the calculation of the approximate Gram matrix contains information on the power spectral density, \(\mathcal{S}\), of a given kernel. Thus, for the Gram matrix to be defined, \(\mathcal{S} > 0\). Consequently, when picking HSGP hyperparameters \(m\) and \(L\) it is important to check \(\mathcal{S} > 0\) for the suggested \(m\) and \(L\) values. The code in the next few cell compares the suitability of the suggested hyperparameters \(m\) and \(L\) for matern52 to that of ExpQuad for the data spanning \(x=-5\) to \(x=95\), with the lengthscale prior between \(\ell=1\) and \(\ell=50\). As we shall see, the suggested hyperparameters for ExpQuad are for not suitable for \(\ell=50\).

Matern \(\mathbf{\nu = 5/2}\)

>>> m52_L = m52_c * 50  # c * s, s is the half-range of the data i.e 0.5*(95 - -5).
>>> print(
... f"""m52_m = {m52_m:.1f}, 
... m52_c = {m52_c:.1f}, 
... m52_s = {50:.1f}
... and m52_L = {m52_L:.1f}"""
... )
m52_m = 543.0, 
m52_c = 4.1, 
m52_s = 50.0
and m52_L = 205.0

>>> m52_eigvals = pm.gp.hsgp_approx.calc_eigenvalues(m52_L, [m52_m])
>>> m52_omega = pt.sqrt(m52_eigvals)
>>> matern52_cov_ell_1 = pm.gp.cov.Matern52(1, ls=1)
>>> matern52_cov_ell_50 = pm.gp.cov.Matern52(1, ls=50)

>>> # check none have underflowed to zero.
>>> assert np.all(matern52_cov_ell_1.power_spectral_density(m52_omega).eval() > 0)
>>> assert np.all(matern52_cov_ell_50.power_spectral_density(m52_omega).eval() > 0)

Squared exponential

>>> # get ExpQuad suggested hyperparams.
>>> epq_m, epq_c = pm.gp.hsgp_approx.approx_hsgp_hyperparams(
...    x_range=[-5, 95], lengthscale_range=[1, 50], cov_func="ExpQuad"
... )

>>> print("Recommended smallest number of basis vectors for ExpQuad (m):", epq_m)
Recommended smallest number of basis vectors for ExpQuad (m): 280
>>> print("Recommended smallest scaling factor for ExpQuad (c):", np.round(epq_c, 1))
Recommended smallest scaling factor for ExpQuad (c): 3.2

>>> epq_L = epq_c * 50  # c * s
>>> print(
... f"""epq_m = {epq_m:.1f},
... epq_c = {epq_c:.1f},
... epq_s = {50:.1f},
... and epq_L = {epq_L:.1f}"""
... )
m52_m = 543.0, 
m52_c = 4.1, 
m52_s = 50.0,
and m52_L = 205.0

>>> epq_eigvals = pm.gp.hsgp_approx.calc_eigenvalues(epq_L, [epq_m])
>>> epq_omega = pt.sqrt(epq_eigvals)
>>> epq_cov_ell_1 = pm.gp.cov.ExpQuad(1, ls=1)
>>> epq_cov_ell_50 = pm.gp.cov.ExpQuad(1, ls=50)

>>> # repeat check as in the Matern52.
>>> assert np.all(epq_cov_ell_1.power_spectral_density(epq_omega).eval() > 0)
>>> assert np.all(
...     epq_cov_ell_50.power_spectral_density(epq_omega).eval() > 0,
...     "Power spectral density underflows when ls = 50.",
... )  # this will not pass assertion.

We see that not all values of \(\mathcal{S}\) are defined for the squared exponential kernel when \(\ell=50\).

To see why, the covariance of the kernels considered are plotted below along with their power spectral densities in log space. The covariance plot shows that for a set \(\ell\), the tails of matern52 are heavier than ExpQuad, while a higher \(\ell\) for a given kernel type gives rise to higher covariance. The power spectral density is inversely proportional to the covariance - essentially the flatter the shape of the covariance function, the narrower the bandwidth and the lower the power spectral density at higher values of \(\omega\). As a result, we see that for ExpQuad with \(\ell = 50\), \(\mathcal{S}\left(\omega\right)\) rapidly decreases towards \(0\) before the domain of \(\omega\) is exhausted, and hence we reach values at which we underflow to \(0\).

>>> x = np.linspace(0, 10, 101)[:, None]
>>> fig, ax = plt.subplots(2, layout="tight", figsize=(10, 6))

>>> ax[0].set_title(f"Covariance")
>>> ax[0].plot(x, epq_cov_ell_1(x).eval()[0], label=r"ExpQuad, $\ell = 1$")
>>> ax[0].plot(x, epq_cov_ell_50(x).eval()[0], label=r"ExpQuad, $\ell = 50$")
>>> ax[0].plot(x, matern52_cov_ell_1(x).eval()[0], label=r"Matern 5/2, $\ell = 1$", linestyle="--")
>>> ax[0].plot(x, matern52_cov_ell_50(x).eval()[0], label=r"Matern 5/2, $\ell = 50$", linestyle="--")
>>> ax[0].set_xlabel(r"$x_\mathrm{p}-x_\mathrm{q}$")
>>> ax[0].set_ylabel(r"$k\left(x_\mathrm{p}-x_\mathrm{q}\right)$")
>>> ax[0].set_yscale("log")
>>> ax[0].set_ylim(1e-10, 1e1)
>>> ax[0].legend(frameon=False, loc="lower left")


>>> ax[1].plot(epq_omega.eval(), epq_cov_ell_1.power_spectral_density(epq_omega).eval())
>>> ax[1].plot(epq_omega.eval(), epq_cov_ell_50.power_spectral_density(epq_omega).eval())
>>> ax[1].plot(
...     m52_omega.eval(), matern52_cov_ell_1.power_spectral_density(m52_omega).eval(), linestyle="--"
... )
>>> ax[1].plot(
...     m52_omega.eval(), matern52_cov_ell_50.power_spectral_density(m52_omega).eval(), linestyle="--"
... )
>>> ax[1].set_title("Power Spectral Density")
>>> ax[1].set_xlabel(r"$\omega$")
>>> ax[1].set_ylabel(r"$\mathcal{S}\left(\omega\right)$")
>>> ax[1].set_yscale("log")
>>> ax[1].set_ylim(1e-10, 3e2)
>>> plt.show()

alt text These underflow issues can arise when using a broad prior on \(\ell\) as you need an \(m\) large enough to cover small lengthscales, but these may cause underflow in \(\mathcal{S}\) when \(\ell\) is large. As the graphs above suggest, one can consider a different kernel with heavier tails such as matern52 or matern32.

Alternatively, if you are certain you need a specific kernel, you can use the linear form of HSGPs (see below) with a boolean mask. In doing so, the sinusoids with vanishingly small coefficients in the linear combination are effectively screened out. E.g:

>>> import pymc as pm
>>> import numpy as np

>>> x = np.sort(np.random.uniform(-1, 1, 10))

>>> large_m, large_l = pm.gp.hsgp_approx.approx_hsgp_hyperparams(
...    x_range=[-1, 1], lengthscale_range=[1E-2, 4], cov_func="ExpQuad"
... )

>>> print(large_m, large_l)
2240, 12.8

>>> with pm.Model() as model:
...     # some broad prior on the lengthscale.
...     ell = pm.HalfNormal('ell', sigma=1)
...     cov_func = pm.gp.cov.ExpQuad(input_dim=1, ls=ell)
...     # setup HSGP.
...     gp = pm.gp.HSGP(m=[large_m], L=[large_l], parametrization="noncentered", cov_func=cov_func)
...     phi, sqrt_psd = gp.prior_linearized(x[:, None])
...     basis_coeffs = pm.Normal("basis_coeffs", size=gp.n_basis_vectors)
...     # create mask that screens out frequencies with underflowing power spectral densities.
...     mask = sqrt_psd > 0
...     # now apply the mask over the m dimension & calculate HSGP function.
...     f = pm.Deterministic("f", phi[:, mask] @ (basis_coeffs[mask] * sqrt_psd[mask]))
...     # setup your observation model
...     ...

Example 2: Working with HSGPs as a parametric, linear model#

One of the main benefits of the HSGP approximation is the ability to integrate it into existing models, especially if you need to do prediction in new x-locations after sampling. Unlike other GP implementations in PyMC, you can bypass the .prior and .conditional API, and instead use HSGP.prior_linearized, which allows you to use pm.Data and pm.set_data for making predictions.

Refer to this section if you’re interested in:

Seeing a two dimensional, or spatial, HSGP example with other predictors in the model.
Using HSGPs for prediction within larger PyMC models.
Convert your HSGP approximation into an HSTP approximation, or an approximation to a TP, or Student-t process.

Data generation#

def simulate_2d(
    beta0_true,
    beta1_true,
    ell_true,
    eta_true,
    sigma_true,
):
    # Create the 2d X locations
    from scipy.stats import qmc

    sampler = qmc.Sobol(d=2, scramble=False, optimization="lloyd")
    X = 20 * sampler.random_base2(m=11) - 10.0

    # add the fixed effect at specific intervals
    ix = 1.0 * (np.abs(X[:, 0] // 5) == 1)
    X = np.hstack((X, ix[:, None]))

    # Draw one sample from the underlying GP
    n = X.shape[0]
    cov_func = eta_true**2 * pm.gp.cov.Matern52(3, ell_true, active_dims=[0, 1])
    gp_true = pm.MvNormal.dist(mu=np.zeros(n), cov=cov_func(X))
    f_true = pm.draw(gp_true, draws=1, random_seed=rng)

    # Add the fixed effects
    mu = beta0_true + beta1_true * X[:, 2] + f_true

    # The observed data is the latent function plus a small amount
    # of Gaussian distributed noise.
    noise_dist = pm.Normal.dist(mu=0.0, sigma=sigma_true)
    y_obs = mu + pm.draw(noise_dist, draws=n, random_seed=rng)
    return y_obs, f_true, mu, X

y_obs, f_true, mu, X = simulate_2d(
    beta0_true=3.0,
    beta1_true=2.0,
    ell_true=1.0,
    eta_true=1.0,
    sigma_true=0.75,
)

# Split into train and test sets
ix_tr = (X[:, 1] < 2) | (X[:, 1] > 4)
ix_te = (X[:, 1] > 2) & (X[:, 1] < 4)

X_tr, X_te = X[ix_tr, :], X[ix_te, :]
y_tr, y_te = y_obs[ix_tr], y_obs[ix_te]

fig = plt.figure(figsize=(13, 4))

ax1 = plt.subplot(131)
ax1.scatter(X_tr[:, 0], X_tr[:, 1], c=mu[ix_tr] - f_true[ix_tr])
ax1.set_title("$\\beta_0 + \\beta_1 X$")
ax1.set_ylabel("$x_1$", rotation=0)

ax2 = plt.subplot(132)
ax2.scatter(X_tr[:, 0], X_tr[:, 1], c=f_true[ix_tr])
ax2.set_title("The spatial GP, $f$")
ax2.set_yticks([])
ax2.set_xlabel("$x_0$")

ax3 = plt.subplot(133)
im = ax3.scatter(X_tr[:, 0], X_tr[:, 1], c=y_obs[ix_tr])
ax3.set_title("The observed data, $y$")
ax3.set_yticks([])

fig.colorbar(im, ax=[ax1, ax2, ax3]);

../_images/46ad2395e911e32a9e2d6bc5b970af802f152cd161bd68ea3877750f9788babb.png

As expected, we clearly see that the test set is in the region where \(2 < x1 < 4\).

Here is the model structure corresponding to our generative scenario. Below we describe its main components.

Model structure#

with pm.Model() as model:
    # Set mutable data
    X_gp = pm.Data("X_gp", X_tr[:, :2])
    X_fe = pm.Data("X_fe", X_tr[:, 2])

    # Priors on regression coefficients
    beta = pm.Normal("beta", mu=0.0, sigma=10.0, shape=2)

    # Prior on the HSGP
    eta = pm.Exponential("eta", scale=2.0)
    ell_dist = pz.maxent(pz.LogNormal(), lower=0.5, upper=5.0, mass=0.9, plot=False)
    ell = ell_dist.to_pymc("ell")

    cov_func = eta**2 * pm.gp.cov.Matern52(input_dim=2, ls=ell)

    # m and c control the fidelity of the approximation
    m0, m1, c = 30, 30, 2.0
    gp = pm.gp.HSGP(m=[m0, m1], c=c, cov_func=cov_func)

    phi, sqrt_psd = gp.prior_linearized(X=X_gp)

    basis_coeffs = pm.Normal("basis_coeffs", size=gp.n_basis_vectors)
    f = pm.Deterministic("f", phi @ (basis_coeffs * sqrt_psd))

    mu = pm.Deterministic("mu", beta[0] + beta[1] * X_fe + f)

    sigma = pm.Exponential("sigma", scale=2.0)
    pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y_tr, shape=X_gp.shape[0])

    idata = pm.sample_prior_predictive(random_seed=rng)

pm.model_to_graphviz(model)

Sampling: [basis_coeffs, beta, ell, eta, sigma, y_obs]

../_images/c3576b3c81101d24af64f3c5010296df03f050511c92c6081a6c6cc347ac72f3.svg

Before sampling and looking at the results, there are a few things to pay attention to in the model above.

Setting the coefficients, centered and non-centered#

First, prior_linearized returns the eigenvector basis, phi, and the square root of the power spectrum at the eigenvalues, sqrt_psd. You have to construct the HSGP approximation from these. The following are the relevant lines of code, showing both the centered and non-centered parameterization.

phi, sqrt_psd = gp.prior_linearized(X=X)

## non-centered
basis_coeffs= pm.Normal("basis_coeffs", size=gp.n_basis_vectors)
f = pm.Deterministic("f", phi @ (beta * sqrt_psd)) 

## centered
basis_coeffs= pm.Normal("basis_coeffs", sigma=sqrt_psd, size=gp.n_basis_vectors)
f = pm.Deterministic("f", phi @ beta) 

Be sure to set the size of basis_coeffs using the n_basis_vectors attribute of the HSGP object (or the number of columns of phi), \(m^* = \prod_i m_i\). In the above example, \(m^* = 30 \cdot 30 = 900\), and is the total number of basis vectors used in the approximation.

Approximating a TP instead of a GP#

We can slightly modify the code above to obtain a Student-t process,

nu = pm.Gamma("nu", alpha=2, beta=0.1)
basis_coeffs= pm.StudentT("basis_coeffs", nu=nu, size=gp.n_basis_vectors)
f = pm.Deterministic("f", phi @ (beta * sqrt_psd)) 

where we use a \(\text{Gamma}(\alpha=2, \beta=0.1)\) prior for \(\nu\), which places around 50% probability that \(\nu > 30\), the point where a Student-T roughly becomes indistinguishable from a Gaussian. See this link for more information.

Results#

Now, let’s sample the model and quickly check the results:

with model:
    idata.extend(pm.sample(chains=4, nuts_sampler="numpyro", random_seed=rng))

idata.sample_stats.diverging.sum().data

array(0)

var_names = [var.name for var in model.free_RVs if var.size.eval() <= 2]
az.summary(idata, var_names=var_names, round_to=2)

	mean	sd	hdi_3%	hdi_97%	ess_bulk	ess_tail	r_hat
beta[0]	2.92	0.12	2.71	3.14	6346.91	2880.37	1.0
beta[1]	2.11	0.11	1.89	2.32	7778.31	2742.37	1.0
eta	0.98	0.06	0.88	1.10	1500.77	2535.33	1.0
ell	0.89	0.10	0.71	1.07	1957.41	2648.51	1.0
sigma	0.85	0.01	0.82	0.87	4008.49	3324.29	1.0

az.plot_trace(
    idata,
    var_names=var_names,
    lines=[("beta", {}, [3, 2]), ("ell", {}, [1]), ("eta", {}, [1]), ("sigma", {}, [0.75])],
);

../_images/c64130a86340e552355f79814e8b9c03eaf80fa1d83cc62459b18b9f3737861c.png

Sampling went great, but, interestingly, we seem to have a bias in the posterior for sigma. It’s not the focus of this notebook, but it’d be interesting to dive into this in a real use-case.

Out-of-sample predictions#

Then, we can just use pm.set_data to make predictions at new points. We’ll show the fit and the predictions together in the plot below.

with model:
    pm.set_data({"X_gp": X[:, :2], "X_fe": X[:, 2]})

    idata_thinned = idata.sel(draw=slice(None, None, 10))
    idata.extend(
        pm.sample_posterior_predictive(idata_thinned, var_names=["f", "mu"], random_seed=rng)
    )

Sampling: []

pm.model_to_graphviz(model)

../_images/b7ecffa2c1e2f646bd73009357596b10c91d05bd8f7a8e4948ddf6a87d50db46.svg

fig = plt.figure(figsize=(13, 4))

ax1 = plt.subplot(131)
ax1.scatter(X[:, 0], X[:, 1], c=f_true)
ax1.set_title("True underlying GP")
ax1.set_xlabel("$x_0$")
ax1.set_ylabel("$x_1$", rotation=0)

ax2 = plt.subplot(132)
f_sd = az.extract(idata.posterior_predictive, var_names="f").std(dim="sample")
ax2.scatter(X[:, 0], X[:, 1], c=f_sd)
ax2.set_title("Std. dev. of the inferred GP")
ax2.set_yticks([])
ax2.set_xlabel("$x_0$")

ax3 = plt.subplot(133)
f_mu = az.extract(idata.posterior_predictive, var_names="f").mean(dim="sample")
im = ax3.scatter(X[:, 0], X[:, 1], c=f_mu)
ax3.set_title("Mean of the inferred GP")
ax3.set_yticks([])
ax3.set_xlabel("$x_0$")

fig.colorbar(im, ax=[ax1, ax2, ax3]);

../_images/ae7c255acc20dda249b009665ffcd4e89981e4bfc485a403e0eff29e7ac8cab0.png

Sampling diagnostics all look good, and we can see that the underlying GP was inferred nicely. We can also see the increase in uncertainty outside of our training data as a horizontal stripe in the middle panel, showing the increased standard deviation of the inferred GP here.

Authors#

Created by Bill Engels and Alexandre Andorra in 2024 (pymc-examples#647)
Use pz.maxent instead of pm.find_constrained_prior, and add random seed. Osvaldo Martin. August 2024
Use pm.gp.util.stabilize in simulate_1d. Use pz.maxent rather than pm.find_constrained_prior in linearized HSGP model. Added comparison between matern52 and ExpQuad power spectral densities. Alexander Armstrong, July-August 2025.

Watermark#

%load_ext watermark
%watermark -n -u -v -iv -w -p xarray

Last updated: Fri Oct 10 2025

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

xarray: 2025.7.1

numpy     : 2.2.6
matplotlib: 3.10.3
pytensor  : 2.31.7
pymc      : 5.25.1
arviz     : 0.22.0
preliz    : 0.20.0

Watermark: 2.5.0

License notice#

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples#

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Important

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

which once rendered could look like:

Categories

Tags

Gaussian Processes: HSGP Reference & First Steps#

References:#

Example 1: Basic HSGP Usage#

Simulate data#

Define and fit the HSGP model#

Posterior predictive plot#

Choosing the HSGP approximation parameters, `m`, `L`, and `c`#

How `L` and `c` affect the basis#

Heuristics for choosing \(m\) and \(c\)#

The HSGP approximate Gram matrix#

Example 2: Working with HSGPs as a parametric, linear model#

Data generation#

Model structure#

Setting the coefficients, centered and non-centered#

Approximating a TP instead of a GP#

Results#

Out-of-sample predictions#

Authors#

Watermark#

License notice#

Citing PyMC examples#

Categories

Tags

Gaussian Processes: HSGP Reference & First Steps#

References:#

Example 1: Basic HSGP Usage#

Simulate data#

Define and fit the HSGP model#

Posterior predictive plot#

Choosing the HSGP approximation parameters, m, L, and c#

How L and c affect the basis#

Heuristics for choosing \(m\) and \(c\)#

The HSGP approximate Gram matrix#

Example 2: Working with HSGPs as a parametric, linear model#

Data generation#

Model structure#

Setting the coefficients, centered and non-centered#

Approximating a TP instead of a GP#

Results#

Out-of-sample predictions#

Authors#

Watermark#

License notice#

Citing PyMC examples#

Choosing the HSGP approximation parameters, `m`, `L`, and `c`#

How `L` and `c` affect the basis#