PyMC Dimensionality#

PyMC provides a number of ways to specify the dimensionality of its distributions. In this document we will not provide an exhaustive explanation but rather an overview and current best practices.

Glossary#

In this document we’ll be using the term dimensionality to refer to the idea of dimensions. Each of the terms below has a specific semantic and computational definition in PyMC. While we share them here they will make much more sense when viewed in the examples below.

  • Implied dimensions → dimensionality that follows from inputs to the RV

  • Support dimensions → dimensions you can NEVER get rid of

  • ndim_support → smallest shape that can result from a random draw. This is a fixed attribute in the distribution definition

  • Shape → final resulting tensor shape

  • Size → shape minus the support dimensions

  • Dims → An array of dimension names

  • Coords → A dictionary mapping dimension names to coordinate values

General Recommendations#

When prototyping implied and size are convenient#

Implied dimensions are easy to specify and great for quickly expanding an existing RV. F

For reusable code we suggest dims#

For any more important work, or reuable work we suggest dims and coords as the labels will be passed to {class}’arviz.InferenceData’. This is both best practice transparency and readability for others. It also is useful in single developer workflows, for example, in cases where there is a 3 dimensional or higher RV it’ll help indiciate which dimension corresponds to which model concept.

Use shape if you’d like to be explicit#

Use shape if you’d like to bypass any dimensionality calculations implicit in PyMC. This will strictly specify the dimensionality to Aesara

When debugging use unique prime numbers#

By using prime numbers it will be easier to determine where how input dimensionalities are being converted to output dimensionalities. Once confident with result then change the dimensionalities to match your data or modeling needs.

Code Examples#

import pymc as pm
import numpy as np

Scalar distribution example#

We can start with the simplest case, a single Normal distribution. We specify one as shown below

normal_dist = pm.Normal.dist()

We can then take a random sample from that same distribution and print both the draw and shape

random_sample = normal_dist.eval()
random_sample, random_sample.shape
(array(0.0038772), ())

In this case we end up with a single scalar value. This is consistent with the distributions ndim_supp as the smallest random draw dimension is a scalar which has a dimension of zero

pm.Normal.rv_op.ndim_supp
0

Implied Example#

If we wanted three draws from differently centered Normals we instead could pass a vector to the parameters. When generating a random draw we would now expect a vector value, in this case a vector if size 3. This is a case of implied dimensions

random_sample = pm.Normal.dist(mu=[1, 10, 100], sigma=0.0001).eval()
random_sample, random_sample.shape
(array([ 1.00006789,  9.99999102, 99.99973951]), (3,))

Shape and Size#

Alternatively we may just want three draws from identical distributions. In this case we could use either shape or size to specify this

random_sample = pm.Normal.dist(size=(3,)).eval()
random_sample, random_sample.shape
(array([-0.25679503,  0.66627063, -0.64598181]), (3,))
random_sample = pm.Normal.dist(shape=(3,)).eval()
random_sample, random_sample.shape
(array([-1.33834928,  0.98445504, -0.49595222]), (3,))

Inspecting dimensionality with a model graph#

A powerful tool to understand and debug dimensionality in PyMC is the pm.model_to_graphviz functionality. Rather than inspecting array outputs we instead can read the Graphviz output to understand the dimensionality.

In the example below the number on the bottom left of each box indicates the dimensionality of the Random Variable. With the scalar distribution it is implied to be one random draw of ndim_support

with pm.Model() as pmodel:
    pm.Normal("scalar")  # shape=()
    pm.Normal("vector (implied)", mu=[1, 2, 3])
    pm.Normal("vector (from shape)", shape=(4,))
    pm.Normal("vector (from size)", size=(5,))

pm.model_to_graphviz(pmodel)
../../_images/7992ed3f86606925ed1f266bc1389132752e64c57f9a85c38362785f8cb0907c.svg

Dims#

A new feature of PyMC is dims support. With many random variables it can become confusing which dimensionality corresponds to which “real world” idea, e.g. number of observations, number of treated units etc. The dims argument is an additional label to help.

with pm.Model() as pmodel:
    pm.Normal("red", size=2, dims="B")

    pm.Normal("one", [1, 2, 3, 4], dims="Dim_A")  # (4,)
    pm.Normal("two", dims="Dim_A")


pm.model_to_graphviz(pmodel)
../../_images/39608c1b8b6006890d1877e574c3e4f9adf7c23a6369ca903c27031eead234ff.svg

Where dims can become increasingly powerful is with the use of coords specified in the model itself. With this it becomes easy to track. As an added bonus the coords and dims will also be present in the returned {class}’arviz.InferenceData’ simplifying the entire workflow.

with pm.Model(
    coords={
        "year": [2020, 2021, 2022],
    }
) as pmodel:

    pm.Normal("Normal_RV", dims="year")

    pm.model_to_graphviz(pmodel)

Vector Distributions#

Some distributions by definition cannot return scalar values as random samples, but instead will return an array as their result. An example is the Multivariate Normal. The simplest possible return shape can be verified using ndim_supp. The value here indicates the smallest shape that can be returned is a vector

pm.MvNormal.rv_op.ndim_supp
1

This can be verified with a random sample as well.

pm.MvNormal.dist(mu=[[1, 2, 3], [4, 5, 6]], cov=np.eye(3) * 0.0001).eval()
array([[1.00160162, 1.99206375, 2.98902506],
       [4.00703891, 4.99735671, 5.98341287]])

Like scalar distributions we can also use all our dimensionality tools as well to specify a set of Multivariate normals

with pm.Model(
    coords={
        "year": [2020, 2021, 2022],
    }
) as pmodel:
    mv = pm.MvNormal("implied", mu=[0, 0, 0], cov=np.eye(3))
    print(mv.shape.eval())

    # Multivariate RVs (ndim_supp > 0)
    assert mv.ndim == 1

    mv = pm.MvNormal("with size", mu=[0, 0], cov=np.eye(2), size=3, dims=("repeats", "implied"))
    print(mv.shape.eval())

    # ⚠ Size dims are always __prepended__
    mv = pm.MvNormal("with shape", mu=[0, 0], cov=np.eye(2), shape=(3, ...), dims=("repeats", ...))
    print(mv.shape.eval())

    mv = pm.MvNormal("with coords", mu=[0, 0], cov=np.eye(2), dims=("year", ...))
    print(mv.shape.eval())

pm.model_to_graphviz(pmodel)
[3]
[3 2]
[3 2]
[3 2]
../../_images/e2c60e827b5ec9dbbbefb9198d79e8a072b64b1c5be699f5597bc46828a4cac7.svg

User caution and practical tips#

While we provide all these tools for convenience, and while PyMC does it best to understand user intent, the result of mixed dimensionality tools may not always result in the final dimensionality intended. Sometimes the model may not indicate an error until sampling, or not indicate an issue at all. When working with dimensionality, particular more complex ones we suggest

  • Using GraphViz to visualize your model before sampling

  • Using the prior predictive to catch errors early

  • Inspecting the returned az.InferenceData object to ensure all array sizes are as intended