pymc.sample_posterior_predictive#

pymc.sample_posterior_predictive(trace, model=None, var_names=None, sample_dims=None, random_seed=None, progressbar=True, progressbar_theme=<rich.theme.Theme object>, return_inferencedata=True, extend_inferencedata=False, predictions=False, idata_kwargs=None, compile_kwargs=None)[source]#

Generate forward samples for var_names, conditioned on the posterior samples of variables found in the trace.

This method can be used to perform different kinds of model predictions, including posterior predictive checks.

The matching of unobserved model variables, and posterior samples in the trace is made based on the variable names. Therefore, a different model than the one used for posterior sampling may be used for posterior predictive sampling, as long as the variables whose posterior we want to condition on have the same name, and compatible shape and coordinates.

Parameters:

tracebackend, list, xarray.Dataset, arviz.InferenceData, or MultiTrace: Trace generated from MCMC sampling, or a list of dicts (eg. points or from find_MAP()), or xarray.Dataset (eg. InferenceData.posterior or InferenceData.prior)
modelModel (optional if in with context): Model to be used to generate the posterior predictive samples. It will generally be the model used to generate the trace, but it doesn’t need to be.
var_namesIterable[str], optional: Names of variables for which to compute the posterior predictive samples. By default, only observed variables are sampled. See the example below for what happens when this argument is customized.
sample_dimslist of str, optional: Dimensions over which to loop and generate posterior predictive samples. When sample_dims is None (default) both “chain” and “draw” are considered sample dimensions. Only taken into account when trace is InferenceData or Dataset.
random_seedint, RandomState or Generator, optional: Seed for the random number generator.
progressbarbool: Whether to display a progress bar in the command line. The bar shows the percentage of completion, the sampling speed in samples per second (SPS), and the estimated remaining time until completion (“expected time of arrival”; ETA).
return_inferencedatabool, default True: Whether to return an arviz.InferenceData (True) object or a dictionary (False).
extend_inferencedatabool, default False: Whether to automatically use arviz.InferenceData.extend() to add the posterior predictive samples to trace or not. If True, trace is modified inplace but still returned.
predictionsbool, default False: Flag used to set the location of posterior predictive samples within the returned arviz.InferenceData object. If False, assumes samples are generated based on the fitting data to be used for posterior predictive checks, and samples are stored in the posterior_predictive. If True, assumes samples are generated based on out-of-sample data as predictions, and samples are stored in the predictions group.
idata_kwargsdict, optional: Keyword arguments for pymc.to_inference_data() if predictions=False or to pymc.predictions_to_inference_data() otherwise.
compile_kwargs: dict, optional: Keyword arguments for pymc.pytensorf.compile_pymc().

Returns:

arviz.InferenceData or Dict: An ArviZ InferenceData object containing the posterior predictive samples (default), or a dictionary with variable names as keys, and samples as numpy arrays.

Examples

Posterior predictive checks and predictions#

The most common use of sample_posterior_predictive is to perform posterior predictive checks (in-sample predictions) and new model predictions (out-of-sample predictions).

import pymc as pm

with pm.Model(coords_mutable={"trial": [0, 1, 2]}) as model:
    x = pm.MutableData("x", [-1, 0, 1], dims=["trial"])
    beta = pm.Normal("beta")
    noise = pm.HalfNormal("noise")
    y = pm.Normal("y", mu=x * beta, sigma=noise, observed=[-2, 0, 3], dims=["trial"])

    idata = pm.sample()
    # in-sample predictions
    posterior_predictive = pm.sample_posterior_predictive(idata).posterior_predictive

with model:
    pm.set_data({"x": [-2, 2]}, coords={"trial": [3, 4]})
    # out-of-sample predictions
    predictions = pm.sample_posterior_predictive(idata, predictions=True).predictions

Using different models#

It’s common to use the same model for posterior and posterior predictive sampling, but this is not required. The matching between unobserved model variables and posterior samples is based on the name alone.

For the last example we could have created a new predictions model. Note that we have to specify var_names explicitly, because the newly defined y was not given any observations:

with pm.Model(coords_mutable={"trial": [3, 4]}) as predictions_model:
    x = pm.MutableData("x", [-2, 2], dims=["trial"])
    beta = pm.Normal("beta")
    noise = pm.HalfNormal("noise")
    y = pm.Normal("y", mu=x * beta, sigma=noise, dims=["trial"])

    predictions = pm.sample_posterior_predictive(idata, var_names=["y"], predictions=True).predictions

The new model may even have a different structure and unobserved variables that don’t exist in the trace. These variables will also be forward sampled. In the following example we added a new extra_noise variable between the inferred posterior noise and the new StudentT observational distribution y:

with pm.Model(coords_mutable={"trial": [3, 4]}) as distinct_predictions_model:
    x = pm.MutableData("x", [-2, 2], dims=["trial"])
    beta = pm.Normal("beta")
    noise = pm.HalfNormal("noise")
    extra_noise = pm.HalfNormal("extra_noise", sigma=noise)
    y = pm.StudentT("y", nu=4, mu=x * beta, sigma=extra_noise, dims=["trial"])

    predictions = pm.sample_posterior_predictive(idata, var_names=["y"], predictions=True).predictions

For more about out-of-model predictions, see this blog post.

The behavior of var_names#

The function returns forward samples for any variable included in var_names, conditioned on the values of other random variables found in the trace.

To ensure the samples are internally consistent, any random variable that depends on another random variable that is being sampled is itself sampled, even if this variable is present in the trace and was not included in var_names. The final list of variables being sampled is shown in the log output.

Note that if a random variable has no dependency on other random variables, these forward samples are equivalent to their prior samples. Likewise, if all random variables are being sampled, the behavior of this function is equivalent to that of sample_prior_predictive().

Warning

A random variable included in var_names will never be copied from the posterior group. It will always be sampled as described above. If you want, you can copy manually via idata.posterior_predictive["var_name"] = idata.posterior["var_name"].

The following code block explores how the behavior changes with different var_names:

from logging import getLogger
import pymc as pm

# Some environments like google colab suppress
# the default logging output of PyMC
getLogger("pymc").setLevel("INFO")

kwargs = {"progressbar": False, "random_seed": 0}

with pm.Model() as model:
    x = pm.Normal("x")
    y = pm.Normal("y")
    z = pm.Normal("z", x + y**2)
    det = pm.Deterministic("det", pm.math.exp(z))
    obs = pm.Normal("obs", det, 1, observed=[20])

    idata = pm.sample(tune=10, draws=10, chains=2, **kwargs)

Default behavior. Generate samples of obs, conditioned on the posterior samples of z found in the trace. These are often referred to as posterior predictive samples in the literature:

with model:
    pm.sample_posterior_predictive(idata, var_names=["obs"], **kwargs)
    # Sampling: [obs]

Re-compute the deterministic variable det, conditioned on the posterior samples of z found in the trace:

pm.sample_posterior_predictive(idata, var_names=["det"], **kwargs)
# Sampling: []

Generate samples of z and det, conditioned on the posterior samples of x and y found in the trace.

with model:
    pm.sample_posterior_predictive(idata, var_names=["z", "det"], **kwargs)
    # Sampling: [z]

Generate samples of y, z and det, conditioned on the posterior samples of x found in the trace.

Note: The samples of y are equivalent to its prior, since it does not depend on any other variables. In contrast, the samples of z and det depend on the new samples of y and the posterior samples of x found in the trace.

with model:
    pm.sample_posterior_predictive(idata, var_names=["y", "z", "det"], **kwargs)
    # Sampling: [y, z]

Same as before, except z is not stored in the returned trace. For computing det we still have to sample z as it depends on y, which is also being sampled.

with model:
    pm.sample_posterior_predictive(idata, var_names=["y", "det"], **kwargs)
    # Sampling: [y, z]

Every random variable is sampled. This is equivalent to calling sample_prior_predictive()

with model:
    pm.sample_posterior_predictive(idata, var_names=["x", "y", "z", "det", "obs"], **kwargs)
    # Sampling: [x, y, z, obs]

Danger

Including a Deterministic() in var_names may incorrectly force a random variable to be resampled, as happens with z in the following example:

with pm.Model() as model:
  x = pm.Normal("x")
  y = pm.Normal("y")
  det_xy = pm.Deterministic("det_xy", x + y**2)
  z = pm.Normal("z", det_xy)
  det_z = pm.Deterministic("det_z", pm.math.exp(z))
  obs = pm.Normal("obs", det_z, 1, observed=[20])

  idata = pm.sample(tune=10, draws=10, chains=2, **kwargs)

  pm.sample_posterior_predictive(idata, var_names=["det_xy", "det_z"], **kwargs)
  # Sampling: [z]

Controlling the number of samples#

You can manipulate the InferenceData to control the number of samples

import pymc as pm

with pm.Model() as model:
    ...
    idata = pm.sample()

Generate 1 posterior predictive sample for every 5 posterior samples.

thinned_idata = idata.sel(draw=slice(None, None, 5))
with model:
    idata.extend(pm.sample_posterior_predictive(thinned_idata))

Generate 5 posterior predictive samples for every posterior sample.

expanded_idata = idata.copy()
expanded_idata.posterior = idata.posterior.expand_dims(pred_id=5)
with model:
    pm.sample_posterior_predictive(
        expanded_idata,
        sample_dims=["chain", "draw", "pred_id"],
        extend_inferencedata=True,
    )