Probability Distributions in PyMC#

The most fundamental step in building Bayesian models is the specification of a full probability model for the problem at hand. This primarily involves assigning parametric statistical distributions to unknown quantities in the model, in addition to appropriate functional forms for likelihoods to represent the information from the data. To this end, PyMC includes a comprehensive set of pre-defined statistical distributions that can be used as model building blocks.

For example, if we wish to define a particular variable as having a normal prior, we can specify that using an instance of the Normal class.

with pm.Model():

    x = pm.Normal('x', mu=0, sigma=1)

A variable requires at least a name argument, and zero or more model parameters, depending on the distribution. Parameter names vary by distribution, using conventional names wherever possible. The example above defines a scalar variable. To make a vector-valued variable, a shape argument should be provided; for example, a 3x3 matrix of beta random variables could be defined with:

with pm.Model():

    p = pm.Beta('p', 1, 1, shape=(3, 3))

Probability distributions are all subclasses of Distribution, which in turn has two major subclasses: Discrete and Continuous. In terms of data types, a Continuous random variable is given whichever floating point type is defined by pytensor.config.floatX, while Discrete variables are given int16 types when pytensor.config.floatX is float32, and int64 otherwise.

All distributions in pm.distributions will have two important methods: random() and logp() with the following signatures:

class SomeDistribution(Continuous):

    def random(self, point=None, size=None):
        ...
        return random_samples

    def logp(self, value):
        ...
        return total_log_prob

PyMC expects the logp() method to return a log-probability evaluated at the passed value argument. This method is used internally by all of the inference methods to calculate the model log-probability that is used for fitting models. The random() method is used to simulate values from the variable, and is used internally for posterior predictive checks.

Custom distributions#

Despite the fact that PyMC ships with a large set of the most common probability distributions, some problems may require the use of functional forms that are less common, and not available in pm.distributions. One example of this is in survival analysis, where time-to-event data is modeled using probability densities that are designed to accommodate censored data.

An exponential survival function, where \(c=0\) denotes failure (or non-survival), is defined by:

\[\begin{split}f(c, t) = \left\{ \begin{array}{l} \exp(-\lambda t), \text{if c=1} \\ \lambda \exp(-\lambda t), \text{if c=0} \end{array} \right.\end{split}\]

Such a function can be implemented as a PyMC distribution by writing a function that specifies the log-probability, then passing that function as a keyword argument to the DensityDist function, which creates an instance of a PyMC distribution with the custom function as its log-probability.

For the exponential survival function, this is:

def logp(value, t, lam):
    return (value * log(lam) - lam * t).sum()

exp_surv = pm.DensityDist('exp_surv', t, lam, logp=logp, observed=failure)

Similarly, if a random number generator is required, a function returning random numbers corresponding to the probability distribution can be passed as the random argument.

Using PyMC distributions without a Model#

Distribution objects, as we have defined them so far, are only usable inside of a Model context. If they are created outside of the model context manager, it raises an error.

y = Binomial('y', n=10, p=0.5)

TypeError: No context on context stack

This is because the distribution classes are designed to integrate themselves automatically inside of a PyMC model. When a model cannot be found, it fails. However, each Distribution has a dist class method that returns a stripped-down distribution object that can be used outside of a PyMC model.

For example, a standalone binomial distribution can be created by:

y = pm.Binomial.dist(n=10, p=0.5)

This allows for probabilities to be calculated and random numbers to be drawn.

>>> y.logp(4).eval()
array(-1.5843639373779297, dtype=float32)

>>> y.random(size=3)
array([5, 4, 3])

Auto-transformation#

To aid efficient MCMC sampling, any continuous variables that are constrained to a sub-interval of the real line are automatically transformed so that their support is unconstrained. This frees sampling algorithms from having to deal with boundary constraints.

For example, the gamma distribution is positive-valued. If we define one for a model:

with pm.Model() as model:
    g = pm.Gamma('g', 1, 1)

We notice a modified variable inside the model value_vars attribute. These variables represent the values of each random variable in the model’s log-likelihood.

>>> model.value_vars
[g_log__]

As the name suggests, the variable g has been log-transformed, and this is the space over which posterior sampling takes place.

The value of the transformed variable is simply back-transformed when a sample is drawn in order to recover the original variable.

By default, auto-transformed variables are ignored when summarizing and plotting model output.