# PyMC Developer Guide#

PyMC is a Python package for Bayesian statistical modeling built on top of Aesara. This document aims to explain the design and implementation of probabilistic programming in PyMC, with comparisons to other PPLs like TensorFlow Probability (TFP) and Pyro. A user-facing API introduction can be found in the API quickstart. A more accessible, user facing deep introduction can be found in Peadar Coyle’s probabilistic programming primer.

## Distribution#

Probability distributions in PyMC are implemented as classes that inherit from Continuous or Discrete. Either of these inherit Distribution which defines the high level API.

For a detailed introduction on how a new distribution should be implemented check out the guide on implementing distributions.

## Reflection#

How tensor/value semantics for probability distributions are enabled in PyMC:

In PyMC, model variables are defined by calling probability distribution classes with parameters:

z = Normal("z", 0, 5)


This is done inside the context of a pm.Model, which intercepts some information, for example to capture known dimensions. The notation aligns with the typically used math notation:

$z \sim \text{Normal}(0, 5)$

A call to a Distribution constructor as shown above returns an Aesara TensorVariable, which is a symbolic representation of the model variable and the graph of inputs it depends on. Under the hood, the variables are created through the dist() API, which calls the RandomVariable Op corresponding to the distribution.

At a high level of abstraction, the idea behind RandomVariable Ops is to create symbolic variables (TensorVariables) that can be associated with the properties of a probability distribution. For example, the RandomVariable Op which becomes part of the symbolic computation graph is associated with the random number generators or probability mass/density functions of the distribution.

In the example above, where the TensorVariable z is created to be $$\text{Normal}(0, 5)$$ random variable, we can get a handle on the corresponding RandomVariable Op instance:

with pm.Model():
z = pm.Normal("z", 0, 5)
print(type(z.owner.op))
# ==> aesara.tensor.random.basic.NormalRV
isinstance(z.owner.op, aesara.tensor.random.basic.RandomVariable)
# ==> True


Now, because the NormalRV can be associated with the probability density function of the Normal distribution, we can now evaluate it through the special pm.logp function:

with pm.Model():
z = pm.Normal("z", 0, 5)
symbolic = pm.logp(z, 2.5)
numeric = symbolic.eval()
# array(-2.65337645)


We can, of course, also work out the math by hand:

\begin{split} \begin{aligned} pdf_{\mathcal{N}}(\mu, \sigma, x) &= \frac{1}{\sigma \sqrt{2 \pi}} \exp^{- 0.5 (\frac{x - \mu}{\sigma})^2} \\ pdf_{\mathcal{N}}(0, 5, 0.3) &= 0.070413 \\ ln(0.070413) &= -2.6533 \end{aligned} \end{split}

In the probabilistic programming context, this enables PyMC and its backend libraries aeppl and Aesara to create and evaluate computation graphs to compute, for example log-prior or log-likelihood values.

## PyMC in Comparison#

Within the PyMC model context, random variables are essentially Aesara tensors that can be used in all kinds of operations as if they were NumPy arrays. This is different compared to TFP and pyro, where one needs to be more explicit about the conversion from random variables to tensors.

Consider the following examples, which implement the below model.

\begin{split} \begin{aligned} z &\sim \mathcal{N}(0, 5) \\ x &\sim \mathcal{N}(z, 1) \\ \end{aligned} \end{split}

### PyMC#

with pm.Model() as model:
z = pm.Normal('z', mu=0., sigma=5.)             # ==> aesara.tensor.var.TensorVariable
x = pm.Normal('x', mu=z, sigma=1., observed=5.) # ==> aesara.tensor.var.TensorVariable
# The log-prior of z=2.5
pm.logp(z, 2.5).eval()                              # ==> -2.65337645
# ???????
x.logp({'z': 2.5})                                  # ==> -4.0439386
# ???????
model.logp({'z': 2.5})                              # ==> -6.6973152


### Tensorflow Probability#


import tensorflow.compat.v1 as tf
from tensorflow_probability import distributions as tfd

with tf.Session() as sess:
z_dist = tfd.Normal(loc=0., scale=5.)            # ==> <class 'tfp.python.distributions.normal.Normal'>
z = z_dist.sample()                              # ==> <class 'tensorflow.python.framework.ops.Tensor'>
x = tfd.Normal(loc=z, scale=1.).log_prob(5.)     # ==> <class 'tensorflow.python.framework.ops.Tensor'>
model_logp = z_dist.log_prob(z) + x
print(sess.run(x, feed_dict={z: 2.5}))           # ==> -4.0439386
print(sess.run(model_logp, feed_dict={z: 2.5}))  # ==> -6.6973152


### Pyro#

z_dist = dist.Normal(loc=0., scale=5.)           # ==> <class 'pyro.distributions.torch.Normal'>
z = pyro.sample("z", z_dist)                     # ==> <class 'torch.Tensor'>
# reset/specify value of z
z.data = torch.tensor(2.5)
x = dist.Normal(loc=z, scale=1.).log_prob(5.)    # ==> <class 'torch.Tensor'>
model_logp = z_dist.log_prob(z) + x
x                                                # ==> -4.0439386
model_logp                                       # ==> -6.6973152


## Behind the scenes of the logp function#

The logp function is straightforward - it is an Aesara function within each distribution. It has the following signature:

Warning

The code block is outdated.

def logp(self, value):
# GET PARAMETERS
param1, param2, ... = self.params1, self.params2, ...
# EVALUATE LOG-LIKELIHOOD FUNCTION, all inputs are (or array that could be convert to) Aesara tensor
total_log_prob = f(param1, param2, ..., value)


In the logp method, parameters and values are either Aesara tensors, or could be converted to tensors. It is rather convenient as the evaluation of logp is represented as a tensor (RV.logpt), and when we linked different logp together (e.g., summing all RVs.logpt to get the model total logp) the dependence is taken care of by Aesara when the graph is built and compiled. Again, since the compiled function depends on the nodes that already in the graph, whenever you want to generate a new function that takes new input tensors you either need to regenerate the graph with the appropriate dependencies, or replace the node by editing the existing graph. In PyMC we use the second approach by using aesara.clone_replace() when it is needed.

As explained above, distribution in a pm.Model() context automatically turn into a tensor with distribution property (PyMC random variable). To get the logp of a free_RV is just evaluating the logp() on itself:

# self is a aesara.tensor with a distribution attached
self.logp_sum_unscaledt = distribution.logp_sum(self)
self.logp_nojac_unscaledt = distribution.logp_nojac(self)


Or for an observed RV. it evaluate the logp on the data:

self.logp_sum_unscaledt = distribution.logp_sum(data)
self.logp_nojac_unscaledt = distribution.logp_nojac(data)


### Model context and Random Variable#

I like to think that the with pm.Model() ... is a key syntax feature and the signature of PyMC model language, and in general a great out-of-the-box thinking/usage of the context manager in Python (with some critics, of course).

Essentially what a context manager does is:

with EXPR as VAR:
USERCODE


which roughly translates into this:

VAR = EXPR
VAR.__enter__()
try:
USERCODE
finally:
VAR.__exit__()


or conceptually:

with EXPR as VAR:
# DO SOMETHING
USERCODE
# DO SOME ADDITIONAL THINGS


So what happened within the with pm.Model() as model: ... block, besides the initial set up model = pm.Model()? Starting from the most elementary:

### Random Variable#

From the above session, we know that when we call e.g. pm.Normal('x', ...) within a Model context, it returns a random variable. Thus, we have two equivalent ways of adding random variable to a model:

with pm.Model() as m:
x = pm.Normal('x', mu=0., sigma=1.)

print(type(x))                              # ==> <class 'aesara.tensor.var.TensorVariable'>
print(m.free_RVs)                           # ==> [x]
print(logpt(x, 5.0))                        # ==> Elemwise{switch,no_inplace}.0
print(logpt(x, 5.).eval({}))                # ==> -13.418938533204672
print(m.logp({'x': 5.}))                    # ==> -13.418938533204672


In general, if a variable has observations (observed parameter), the RV is an observed RV, otherwise if it has a transformed (transform parameter) attribute, it is a transformed RV otherwise, it will be the most elementary form: a free RV. Note that this means that random variables with observations cannot be transformed.

### Additional things that pm.Model does#

In a way, pm.Model is a tape machine that records what is being added to the model, it keeps track the random variables (observed or unobserved) and potential term (additional tensor that to be added to the model logp), and also deterministic transformation (as bookkeeping):

• named_vars

• free_RVs

• observed_RVs

• deterministics

• potentials

• missing_values The model context then computes some simple model properties, builds a bijection mapping that transforms between dictionary and numpy/Aesara ndarray, thus allowing the logp/dlogp functions to have two equivalent versions: One takes a dict as input and the other takes an ndarray as input. More importantly, a pm.Model() contains methods to compile Aesara functions that take Random Variables (that are also initialised within the same model) as input, for example:

with pm.Model() as m:
z = pm.Normal('z', 0., 10., shape=10)
x = pm.Normal('x', z, 1., shape=10)

print(m.initial_point)
print(m.dict_to_array(m.initial_point))  # ==> m.bijection.map(m.initial_point)
print(m.bijection.rmap(np.arange(20)))
# {'z': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), 'x': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])}
# [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
# {'z': array([10., 11., 12., 13., 14., 15., 16., 17., 18., 19.]), 'x': array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])}

list(filter(lambda x: "logp" in x, dir(pm.Model)))
#['d2logp',
# 'd2logp_nojac',
# 'datalogpt',
# 'dlogp',
# 'dlogp_array',
# 'dlogp_nojac',
# 'fastd2logp',
# 'fastd2logp_nojac',
# 'fastdlogp',
# 'fastdlogp_nojac',
# 'fastlogp',
# 'fastlogp_nojac',
# 'logp',
# 'logp_array',
# 'logp_dlogp_function',
# 'logp_elemwise',
# 'logp_nojac',
# 'logp_nojact',
# 'logpt',
# 'varlogpt']


### Logp and dlogp#

The model collects all the random variables (everything in model.free_RVs and model.observed_RVs) and potential term, and sum them together to get the model logp:

@property
def logpt(self):
"""Aesara scalar of log-probability of the model"""
with self:
factors = [var.logpt for var in self.basic_RVs] + self.potentials
logp = at.sum([at.sum(factor) for factor in factors])
...
return logp


which returns an Aesara tensor that its value depends on the free parameters in the model (i.e., its parent nodes from the Aesara graph). You can evaluate or compile into a python callable (that you can pass numpy as input args). Note that the logp tensor depends on its input in the Aesara graph, thus you cannot pass new tensor to generate a logp function. For similar reason, in PyMC we do graph copying a lot using aesara.clone_replace to replace the inputs to a tensor.

with pm.Model() as m:
z = pm.Normal('z', 0., 10., shape=10)
x = pm.Normal('x', z, 1., shape=10)
y = pm.Normal('y', x.sum(), 1., observed=2.5)

print(m.basic_RVs)    # ==> [z, x, y]
print(m.free_RVs)     # ==> [z, x]

type(m.logpt)
# aesara.tensor.var.TensorVariable

m.logpt.eval({x: np.random.randn(*x.tag.test_value.shape) for x in m.free_RVs})
# array(-51.25369126)


PyMC then compiles a logp function with gradient that takes model.free_RVs as input and model.logpt as output. It could be a subset of tensors in model.free_RVs if we want a conditional logp/dlogp function:

def logp_dlogp_function(self, grad_vars=None, **kwargs):
if grad_vars is None:
grad_vars = list(typefilter(self.free_RVs, continuous_types))
else:
...
varnames = [var.name for var in grad_vars]  # In a simple case with only continous RVs,
# this is all the free_RVs
extra_vars = [var for var in self.free_RVs if var.name not in varnames]


ValueGradFunction is a callable class which isolates part of the Aesara graph to compile additional Aesara functions. PyMC relies on aesara.clone_replace to copy the model.logpt and replace its input. It does not edit or rewrite the graph directly.

The important parts of the above function is highlighted and commented. On a high level, it allows us to build conditional logp function and its gradient easily. Here is a taste of how it works in action:

inputlist = [np.random.randn(*x.tag.test_value.shape) for x in m.free_RVs]

func = m.logp_dlogp_function()
func.set_extra_values({})
input_dict = {x.name: y for x, y in zip(m.free_RVs, inputlist)}
print(input_dict)
input_array = func.dict_to_array(input_dict)
print(input_array)
print(" ===== ")
func(input_array)

# {'z': array([-0.7202002 ,  0.58712205, -1.44120196, -0.53153001, -0.36028732,
#         -1.49098414, -0.80046792, -0.26351819,  1.91841949,  1.60004128]), 'x': array([ 0.01490006,  0.60958275, -0.06955203, -0.42430833, -1.43392303,
#         1.13713493,  0.31650495, -0.62582879,  0.75642811,  0.50114527])}
# [-0.7202002   0.58712205 -1.44120196 -0.53153001 -0.36028732 -1.49098414
#     -0.80046792 -0.26351819  1.91841949  1.60004128  0.01490006  0.60958275
#     -0.06955203 -0.42430833 -1.43392303  1.13713493  0.31650495 -0.62582879
#     0.75642811  0.50114527]
#     =====
# (array(-51.0769075),
#     array([ 0.74230226,  0.01658948,  1.38606194,  0.11253699, -1.07003284,
#             2.64302891,  1.12497754, -0.35967542, -1.18117557, -1.11489642,
#             0.98281586,  1.69545542,  0.34626619,  1.61069443,  2.79155183,
#         -0.91020295,  0.60094326,  2.08022672,  2.8799075 ,  2.81681213]))

irv = 1
print("Condition Logp: take %s as input and conditioned on the rest."%(m.free_RVs[irv].name))
func_conditional.set_extra_values(input_dict)
input_array2 = func_conditional.dict_to_array(input_dict)
print(input_array2)
print(" ===== ")
func_conditional(input_array2)

# Condition Logp: take x as input and conditioned on the rest.
# [ 0.01490006  0.60958275 -0.06955203 -0.42430833 -1.43392303  1.13713493
#     0.31650495 -0.62582879  0.75642811  0.50114527]
#     =====
# (array(-51.0769075),
#     array([ 0.98281586,  1.69545542,  0.34626619,  1.61069443,  2.79155183,
#         -0.91020295,  0.60094326,  2.08022672,  2.8799075 ,  2.81681213]))


So why is this necessary? One can imagine that we just compile one logp function, and do bookkeeping ourselves. For example, we can build the logp function in Aesara directly:

import aesara
func = aesara.function(m.free_RVs, m.logpt)
func(*inputlist)
# array(-51.0769075)

func_d = aesara.function(m.free_RVs, logpt_grad)
func_d(*inputlist)
# [array([ 0.74230226,  0.01658948,  1.38606194,  0.11253699, -1.07003284,
#          2.64302891,  1.12497754, -0.35967542, -1.18117557, -1.11489642]),
#  array([ 0.98281586,  1.69545542,  0.34626619,  1.61069443,  2.79155183,
#         -0.91020295,  0.60094326,  2.08022672,  2.8799075 ,  2.81681213])]


Similarly, build a conditional logp:

shared = aesara.shared(inputlist)
func2 = aesara.function([m.free_RVs], m.logpt, givens=[(m.free_RVs, shared)])
print(func2(inputlist))
# -51.07690750130328


The above also gives the same logp and gradient as the output from model.logp_dlogp_function. But the difficulty is to compile everything into a single function:
func_logp_and_grad = aesara.function(m.free_RVs, [m.logpt, logpt_grad])

We want to have a function that return the evaluation and its gradient re each input: value, grad = f(x), but the naive implementation does not work. We can of course wrap 2 functions - one for logp one for dlogp - and output a list. But that would mean we need to call 2 functions. In addition, when we write code using python logic to do bookkeeping when we build our conditional logp. Using aesara.clone_replace, we always have the input to the Aesara function being a 1d vector (instead of a list of RV that each can have very different shape), thus it is very easy to do matrix operation like rotation etc.