DEMetropolis(Z) Sampler Tuning#

import time

import arviz as az

az.rcParams["plot.max_subplots"] = 100
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

from matplotlib import cm, gridspec

print(f"Running on PyMC v{pm.__version__}")

WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

Running on PyMC v0+untagged.9358.g8ea092d

%load_ext watermark
az.style.use("arviz-darkgrid")
rng = np.random.default_rng(1234)

Background#

For continuous variables, the default PyMC sampler (NUTS) requires that gradients are computed, which PyMC does through autodifferentiation. However, in some cases, a PyMC model may not be supplied with gradients (for example, by evaluating a numerical model outside of PyMC) and an alternative sampler is necessary. The DEMetropolisZ sampler is an efficient choice for gradient-free inference. The implementation of DEMetropolisZ in PyMC is based on ter Braak and Vrugt [2008] but with a modified tuning scheme. This notebook compares various tuning parameter settings for the sampler, including the drop_tune_fraction parameter which was introduced in PyMC.

Key Take-Away (TL;DR)#

The DEMetropolisZ sampler implementation in PyMC has sensible defaults; however, two manual adjustments improved sampler performance in this notebook: setting tuning to “scaling”, and setting proposal_dist to pm.NormalProposal. In the simplest form, the manual adjustments look like the following:

with model:
    step = pm.DEMetropolisZ(
        tune='scaling', 
        proposal_dist=pm.NormalProposal
    )
    idata = pm.sample(step=step)

Sampler Approach#

The different flavors of the Metropolis algorithm for continuous variables in PyMC have different methods for deciding how far and in which direction to make jumps. The vanilla pm.Metropolis sampler performs a tuning process to scale the size of the jumps. Each jump is then a random draw from a proposal distribution times the tuned scaling factor.

The differential evolution (DE) Metropolis algorithms use randomly selected draws from other chains (DEMetropolis) or from past draws of the current chain (DEMetropolis(Z)) to make more educated jumps. Here is the formula for making jumps for the DEMetropolis algorithms using the nomenclature in ter Braak and Vrugt [2008]:

Differential Evolution Jump Equation#

\[x^*=x_i + \gamma(x_{R1}-x_{R2}) + \epsilon\]

where:

\(x^*=\) proposal
\(x_i=\) current state (sample)
\(\gamma=\) factor that determines the size of the jump relative to a random vector. This is referred to as lamb in the PyMC implementation and is tuned by default (with a starting value of \({2.38}/{\sqrt{2d}}\) where \(d=\) the number of dimensions [i.e., parameters]).
\((x_{R1}-x_{R2})=\) a random vector based on the subtraction of two randomly selected past states (samples). DEMetropolis selects from other chains and DEMetropolis(Z) selects from past draws of the current chain.
\(\epsilon=\) additional noise that is added to each move:

\[\epsilon \sim \mathscr{D}_{p}(A)\cdot b\]

where:

\(\mathscr{D}_{p}(A)=\) a proposal distribution centered on zero with a variance term of A (defaults to a Uniform proposal with A=1 [i.e., -1 to 1] in PyMC).
\(b=\) scaling factor. This is scaling in PyMC and can be tuned (defaults to 0.001).

In PyMC, we can tune either lamb (\(\gamma\)), or scaling (\(b\)), and the other parameter is fixed.

Problem Statement#

In this notebook, a 10-dimensional multivariate normal target density will be sampled with DEMetropolisZ while varying four parameters to identify efficient sampling schemes. The four parameters are the following:

drop_tuning_fraction, which determines the number of samples from the tuning phase that are recycled for the purpose of random vector \((x_{R1}-x_{R2})\) selection,
lamb (\(\gamma\)), which scales the size of the jumps relative to the random vector,
scaling (\(b\)), which scales the size of the jumps for the noise term \(\epsilon\), and
proposal_distribution (\(\mathscr{D}_{p}\)), which determines the shape from which to draw \(\epsilon\).

We will evaluate these sampling parameters based on three sampling metrics: effective sample size, autocorrelation, and sample acceptance rates.

Target Distribution#

We use a multivariate normal target density with some correlation in the first few dimensions. The function gen_mvnormal_params generates the parameters for the target distribution.

def gen_mvnormal_params(D):
    # means=zero
    mu = np.zeros(D)
    # sigmas=1 to start
    cov = np.eye(D)
    # add variance and covariance in the first 5 dimensions
    cov[:5, :5] = np.array(
        [
            [1, 0.5, 0, 0, 0],
            [0.5, 2, 2, 0, 0],
            [0, 2, 3, 0, 0],
            [0, 0, 0, 4, 4],
            [0, 0, 0, 4, 5],
        ]
    )
    return mu, cov

The function make_model_sample accepts the multivariate normal parameters mu and cov and outputs a PyMC model and a random sample of the target distribution.

def make_model_sample(mu, cov, draws=1000):
    with pm.Model() as model:
        x = pm.MvNormal("x", mu=mu, cov=cov, shape=(len(mu),))

    target_sample = pm.draw(x, draws=draws)
    target_idata = az.data.convert_to_inference_data(target_sample[np.newaxis, :], group="target")
    return model, target_idata

The example will be 10-dimensional

D = 10
mu, cov = gen_mvnormal_params(D)
model, target_idata = make_model_sample(mu, cov)

c:\Users\greg\.conda\envs\pymc-dev\Lib\site-packages\arviz\data\inference_data.py:153: UserWarning: target group is not defined in the InferenceData scheme
  warnings.warn(

It is handy to have a picture of the distribution in our minds.

az.plot_pair(target_idata.target, kind="kde", figsize=(10, 8))
plt.suptitle("10-Dimensional Target Distribution");

../_images/914473e8d0a93efa73f7e8663a32bfa84b85adb927865a9205100e98ed535789.png

Helper Functions#

We will define a number of helper functions upfront to reuse throughout the notebook.

Sampling#

sample_model performs MCMC, returns the trace and the sampling duration.

MCMC Metrics#

To evaluate the MCMC samples, we will use three different metrics.

Effective sample size (ESS) is the sample size adjusted for autocorrelation in the trace. The ESS will be expressed as a percentage \({(ESS)}/{(\text{Total Draws})}*100\). A higher ESS is better. For a Metropolis sampler, 2% ESS is fairly high.
Autocorrelation in the trace is a sign of sampler inefficiency, and should be close to zero. The autocorrelation at sample 100 (after tuning) will be used for comparison.
Acceptance rate of sample draws should be roughly flat over the course of the trace. Decreasing acceptance rates indicate that the step size started off too small, and increasing acceptance rate indicate that the step size started off too large.

calc_ess_pct calculates \({(ESS)}/{(\text{Total Draws})}*100\).

calc_autocorr calculates the autocorrelation for the trace after a specified number of samples.

calc_acceptance calculates the slope and intercept of the acceptance rate for the trace.

Repeated Sampling and Calculations#

sample_model_calc_metrics samples the model iteratively for different parameter values, calculates the metrics for each trace, and returns a dataframe with the results.

average_results takes the large results dataframe and averages replicate runs.

Plotting#

plot_metric_results plots 2x2 summary of results for multiple inferences.

plot_ess_trace_drop_tune_fraction plots the effective sample size and the traces for various tune_drop_fractions.

Show code cell source

Hide code cell source

def plot_ess_trace_drop_tune_fraction(df_results):
    df_temp = df_results.ess_pct.unstack("run").T

    fig = plt.figure(dpi=100, figsize=(12, 8))
    gs = gridspec.GridSpec(4, 2, figure=fig, width_ratios=[1, 2])
    ax_left = plt.subplot(gs[:, 0])
    ax_right_bottom = plt.subplot(gs[3, 1])
    axs_right = [
        plt.subplot(gs[0, 1], sharex=ax_right_bottom),
        plt.subplot(gs[1, 1], sharex=ax_right_bottom),
        plt.subplot(gs[2, 1], sharex=ax_right_bottom),
        ax_right_bottom,
    ]
    for ax in axs_right[:-1]:
        plt.setp(ax.get_xticklabels(), visible=False)

    # ess plot
    ax_left.bar(
        x=df_temp.columns,
        height=df_temp.mean(),
        width=0.05,
        alpha=0,
        yerr=df_temp.sem(),
    )
    ax_left.set_xlabel("tune_drop_fraction")
    ax_left.set_ylabel("$ESS$   [%]")
    ax_left.plot(df_temp.columns, df_temp.mean(), ":")

    # traceplots
    notes = [
        "Wide variance in the chain after tuning",
        "",
        "",
        "Bottleneck of reduced variance after tuning",
    ]
    for ax, drop_fraction, note in zip(axs_right, df_temp.columns, notes):
        ax.set_ylabel("$f_{drop}$=" + f"{drop_fraction}")
        for r, idata in enumerate(df_results.loc[(drop_fraction)].idata):
            # combine warmup and draw iterations into one array:
            samples = np.vstack(
                [
                    idata.warmup_posterior.x.sel(chain=0).values,
                    idata.posterior.x.sel(chain=0).values,
                ]
            )
            ax.plot(samples, linewidth=0.25)
        ax.axvline(N_tune, linestyle="--", linewidth=0.5, label="End of Tuning")
        ax.text(10100, -12, note)
    axs_right[0].legend(loc="upper right")

    axs_right[0].set_title(f"Traces")
    ax_left.set_title("Effective Sample Size")
    ax_right_bottom.set_xlabel("Iteration")
    plt.suptitle("Tune_drop_fraction Effective Sample Size Results", fontsize=18)
    plt.tight_layout()

plot_autocorr_drop_tune_fraction plots the autocorrelation for the traces for various tune_drop_fractions.

plot_acceptance_rate_drop_tune_fraction plots the acceptance rate for the traces for various tune_drop_fractions.

`Drop_tune_fraction`#

Now we will compare different drop_tune_fractions and leave all other parameters as the defaults.

N_tune = N_draws = 10000
df_results = sample_model_calc_metrics(
    model,
    param="tune_drop_fraction",
    param_values=[0, 0.5, 0.9, 1.0],
    N_tune=N_tune,
    N_draws=N_draws,
)

Effective Sample Size#

Here, the mean effective sample size is plotted with standard errors. Next to it, the traces of all chains in one dimension are shown to better understand why the effective sample sizes are different. A larger effective sample size is better.

plot_ess_trace_drop_tune_fraction(df_results)

../_images/e5737836358ed727a08b9dddf2f2f71de96d08f0130c11bd30f8ba9b0a6228ad.png

We can see that the ESS is higher for drop_tune_fraction equal to 0.5 or 0.9 compared to 0.0 or 1.0. From looking at the traces, we see that the variance in the chains are wide after tuning when the drop_tune_fraction = 0.0. At the other extreme, the chains go through a low-variance bottleneck when drop_tune_fraction = 1.0.

Autocorrelation#

A diagnostic measure for the effect we can see above is the autocorrelation in the sampling phase. A lower autocorrelation is better.

plot_autocorr_drop_tune_fraction(df_results)

../_images/620e4b15a23e09b44ac15492e972d7a224003a63a1933c0b5fc932f6debbb60d.png

By step 100, the autocorrelation for drop_tune_fraction = 0.9 has already declined near zero. The autocorrelation remains higher for the other drop_tune_fractions either because sample steps are too large (drop_tune_fraction = 0.0), or too small (drop_tune_fraction = 1.0). When the entire tuning history is dropped (drop_tune_fraction = 1.0), the chain has to diverge from its current position back into the typical set, which takes much longer.

Acceptance Rate#

The rolling mean over the 'accepted' sampler stat shows the difference in the sampler performance for various drop_tune_fractions. A downward trend in the acceptance rate after tuning indicates that the proposals are starting off too narrow, and an upward trend in the acceptance rate after tuning indicates that the proposals are starting off too wide.

plot_acceptance_rate_drop_tune_fraction(df_results)

../_images/eec242bf3dacee302758fa2d4ad380f6885552cd13ba559812c6548775aff0e7.png

The trend in acceptance rate for drop_tune_fraction = 0.9 is approximately flat. When the tuning history is dropped (drop_tune_fraction = 1.0), the acceptance rate shoots up to almost 100 %, because the proposals are too narrow.

Summary#

The results across all three metrics plus the duration of sampling are included below.

plot_metric_results(df_results);

../_images/643690ec080c574d6c5dba5b4c4b8f17e11e0bd075f8c0cbd5d4c76f278b7a17.png

`Lamb`#

Lamb is the factor that determines the size of the jump relative to a random vector and is tuned by default. The starting value is \({2.38}/{\sqrt{2d}}\) where \(d=\) the number of dimensions [i.e., parameters]). We will vary lamb by one order of magnitude above and below the default and tune the scaling parameter to keep lamb fixed.

param = "lamb"
d = np.array([100, 20, 10, 5, 1])
lambs = 2.38 / np.sqrt(2 * d)
lambs

array([0.16829141, 0.37631104, 0.53218418, 0.75262208, 1.68291414])

df_results = sample_model_calc_metrics(
    model,
    param=param,
    param_values=lambs,
    step_kwargs=dict(tune="scaling"),
    N_tune=N_tune,
    N_draws=N_draws,
)

plot_metric_results(df_results, param=param);

../_images/7e6b215d84b3fd02b2046e9a8fcd0888d6b4eb01180d56c76328b09362c7348b.png

The results show that the default starting value for lamb (about 0.5) is good. Also note, these results can be compared to the results for drop_tune_fraction = 0.9 above, for which lamb was tuned, and scaling was specified as 0.001 (the defaults). The ESS % was about 1.8 for that tuning set-up. Therefore, these results indicate that fixing lamb and tuning scaling was better than fixing scaling and tuning lamb for this problem.

In hindsight, this result makes sense because the random vector term (lamb * \((x_{R1}-x_{R2})\)) is inherently scaled during sampling, because it draws from previous samples. The noise term \(\epsilon \sim \mathscr{D}_{p}(A)\)* scaling is fixed during sampling, so it makes sense to tune scaling before sampling.

`Scaling`#

scaling determines the variance in the noise term and defaults to 0.001. We will vary scaling by a total of 6 orders of magnitude and tune lamb.

scalings = np.array([1, 0.1, 0.01, 0.001, 0.0001, 0.00001])
param = "scaling"

df_results = sample_model_calc_metrics(
    model,
    param="scaling",
    param_values=scalings,
    step_kwargs=dict(tune="lambda"),
    N_tune=N_tune,
    N_draws=N_draws,
)

axes = plot_metric_results(df_results, param=param)
for ax in axes.flatten():
    ax.set_xscale("log")

../_images/278015d43a734c249c28060bc8b23d3e11b5544b7cf60c16babb121c4d82d0d4.png

Larger values of lamb performed better for this problem. However, tuning scaling and fixing lamb (the previous experiment) performed better.

`Proposal_distribution`#

Finally, we can also vary the proposal_distribution (\(\mathscr{D}_{p}\)), which determines the shape from which to draw \(\epsilon\). PyMC provides four proposal distributions for use with continuous variables: Uniform (the default), Normal, Cauchy, and Laplace. For this experiment, we will tune scaling, to allow each proposal distribution to be optimized, and keep lamb fixed.

param = "proposal_dist"
proposal_dists = [pm.UniformProposal, pm.NormalProposal, pm.CauchyProposal, pm.LaplaceProposal]

df_results = sample_model_calc_metrics(
    model,
    param=param,
    param_values=proposal_dists,
    step_kwargs=dict(tune="scaling"),
    N_tune=N_tune,
    N_draws=N_draws,
)

axes = plot_metric_results(df_results, param=param)
for ax in axes.flatten():
    labels = ax.get_xticklabels()
    ax.set_xticklabels(labels, rotation=45)

C:\Users\greg\AppData\Local\Temp\ipykernel_16780\1850776206.py:4: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels(labels, rotation=45)

../_images/2ccb3e03e61241d47b8f68956c1619f1b6b60da3d5daf29529c4eee4e9c1557d.png

All the distributions perform reasonably well. The Normal and Uniform distributions appear to perform slightly better than the others, considering all three metrics. Note that ter Braak and Vrugt [2008] suggest that the proposal_dist should have unbounded support to maintain ergodicity, therefore the Normal distribution is preferred.

Conclusions#

Based on the experimentation above, the best performing sampling scheme for this distribution has two modifications to the default. Here is what it looks like in a simplified form:

with model:
    step = pm.DEMetropolisZ(tune="scaling", proposal_dist=pm.NormalProposal)
    idata = pm.sample(step=step, chains=1, tune=N_tune, draws=N_draws)

Sequential sampling (1 chains in 1 job)
DEMetropolisZ: [x]

100.00% [20000/20000 00:09<00:00 Sampling chain 0, 0 divergences]

Sampling 1 chain for 10_000 tune and 10_000 draw iterations (10_000 + 10_000 draws total) took 10 seconds.
Only one chain was sampled, this makes it impossible to run some convergence checks

print("ESS % =", calc_ess_pct(idata).mean().round(1))

ESS % = 2.9

Authors#

Conceived by Micheal Osthege, expanded by Greg Brunkhorst.

References#

[1] (1,2,3)

Cajo J.F. ter Braak and Jasper A. Vrugt. Differential evolution markov chain with snooker updater and fewer chains. Statistics and Computing, pages 435–446, 2008. URL: https://link.springer.com/content/pdf/10.1007/s11222-008-9104-9.pdf?pdf=button.

%watermark -n -u -v -iv -w

Last updated: Fri Feb 10 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.7.0

arviz     : 0.14.0
sys       : 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:12:32) [MSC v.1929 64 bit (AMD64)]
numpy     : 1.24.0
pandas    : 1.5.2
matplotlib: 3.6.2
pymc      : 5.0.1+5.ga7f361bd

Watermark: 2.3.1

License notice#

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples#

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Important

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

which once rendered could look like:

Categories

Tags

DEMetropolis(Z) Sampler Tuning#

Background#

Key Take-Away (TL;DR)#

Sampler Approach#

Differential Evolution Jump Equation#

Problem Statement#

Target Distribution#

Helper Functions#

Sampling#

MCMC Metrics#

Repeated Sampling and Calculations#

Plotting#

`Drop_tune_fraction`#

Effective Sample Size#

Autocorrelation#

Acceptance Rate#

Summary#

`Lamb`#

`Scaling`#

`Proposal_distribution`#

Conclusions#

Authors#

References#

License notice#

Citing PyMC examples#

Categories

Tags

DEMetropolis(Z) Sampler Tuning#

Background#

Key Take-Away (TL;DR)#

Sampler Approach#

Differential Evolution Jump Equation#

Problem Statement#

Target Distribution#

Helper Functions#

Sampling#

MCMC Metrics#

Repeated Sampling and Calculations#

Plotting#

Drop_tune_fraction#

Effective Sample Size#

Autocorrelation#

Acceptance Rate#

Summary#

Lamb#

Scaling#

Proposal_distribution#

Conclusions#

Authors#

References#

License notice#

Citing PyMC examples#

`Drop_tune_fraction`#

`Lamb`#

`Scaling`#

`Proposal_distribution`#