import arviz as az
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
import theano
import theano.tensor as tt

from pymc3.distributions import continuous, distribution
from theano import scan, shared

%config InlineBackend.figure_format = 'retina'
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")
floatX = "float32"


# Conditional Autoregressive (CAR) model#

A walkthrough of implementing a Conditional Autoregressive (CAR) model in PyMC3, with WinBUGS/PyMC2 and Stan code as references.

As a probabilistic language, there are some fundamental differences between PyMC3 and other alternatives such as WinBUGS, JAGS, and Stan. In this notebook, I will summarise some heuristics and intuition I got over the past two years using PyMC3. I will outline some thinking in how I approach a modelling problem using PyMC3, and how thinking in linear algebra solves most of the programming problems. I hope this notebook will shed some light onto the design and features of PyMC3, and similar languages that are built on linear algebra packages with a static world view (e.g., Edward, which is based on Tensorflow).

For more resources comparing between PyMC3 codes and other probabilistic languages:

## Background information#

Suppose we want to implement a Conditional Autoregressive (CAR) model with examples in WinBUGS/PyMC2 and Stan.
For the sake of brevity, I will not go into the details of the CAR model. The essential idea is autocorrelation, which is informally “correlation with itself”. In a CAR model, the probability of values estimated at any given location $$y_i$$ are conditional on some neighboring values $$y_j, _{j \neq i}$$ (in another word, correlated/covariated with these values):

$y_i \mid y_j, j \neq i \sim \mathcal{N}(\alpha \sum_{j = 1}^n b_{ij} y_j, \sigma_i^{2})$

where $$\sigma_i^{2}$$ is a spatially varying covariance parameter, and $$b_{ii} = 0$$.

Here we will demonstrate the implementation of a CAR model using a canonical example: the lip cancer risk data in Scotland between 1975 and 1980. The original data is from (Kemp et al. 1985). This dataset includes observed lip cancer case counts at 56 spatial units in Scotland, with the expected number of cases as intercept, and an area-specific continuous variable coded for the proportion of the population employed in agriculture, fishing, or forestry (AFF). We want to model how lip cancer rates (O below) relate to AFF (aff below), as exposure to sunlight is a risk factor.

$O_i \sim \mathcal{Poisson}(\text{exp}(\beta_0 + \beta_1*aff + \phi_i + \log(\text{E}_i)))$
$\phi_i \mid \phi_j, j \neq i \sim \mathcal{N}(\alpha \sum_{j = 1}^n b_{ij} \phi_j, \sigma_i^{2})$

Setting up the data:

# Read the data from file containing columns: NAME, CANCER, CEXP, AFF, ADJ, WEIGHTS

# name of the counties
county = df_scot_cancer["NAME"].values

# observed
O = df_scot_cancer["CANCER"].values
N = len(O)

# expected (E) rates, based on the age of the local population
E = df_scot_cancer["CEXP"].values
logE = np.log(E)

# proportion of the population engaged in agriculture, forestry, or fishing (AFF)
aff = df_scot_cancer["AFF"].values / 10.0

# Spatial adjacency information: column (ADJ) contains list entries which are preprocessed to obtain adj as list of lists
df_scot_cancer["ADJ"].apply(lambda x: [int(val) for val in x.strip("][").split(",")]).to_list()
)

# Change to Python indexing (i.e. -1)

# spatial weight: column (WEIGHTS) contains list entries which are preprocessed to obtain weights as list of lists
weights = (
df_scot_cancer["WEIGHTS"]
.apply(lambda x: [int(val) for val in x.strip("][").split(",")])
.to_list()
)
Wplus = np.asarray([sum(w) for w in weights])


## A WinBUGS/PyMC2 implementation#

The classical WinBUGS implementation (more information here):

model
{
for (i in 1 : regions) {
O[i] ~ dpois(mu[i])
log(mu[i]) <- log(E[i]) + beta0 + beta1*aff[i]/10 + phi[i] + theta[i]
theta[i] ~ dnorm(0.0,tau.h)
}
phi[1:regions] ~ car.normal(adj[], weights[], Wplus[], tau.c)

beta0 ~ dnorm(0.0, 1.0E-5)  # vague prior on grand intercept
beta1 ~ dnorm(0.0, 1.0E-5)  # vague prior on covariate effect

tau.h ~ dgamma(3.2761, 1.81)
tau.c ~ dgamma(1.0, 1.0)

sd.h <- sd(theta[]) # marginal SD of heterogeneity effects
sd.c <- sd(phi[])   # marginal SD of clustering (spatial) effects

alpha <- sd.c / (sd.h + sd.c)
}


The main challenge to porting this model to PyMC3 is the car.normal function in WinBUGS. It is a likelihood function that conditions each realization on some neighbour realization (a smoothed property). In PyMC2, it could be implemented as a custom likelihood function (a @stochastic node) mu_phi:

@stochastic
def mu_phi(tau=tau_c, value=np.zeros(N)):
# Calculate mu based on average of neighbours
mu = np.array([ sum(weights[i]*value[adj[i]])/Wplus[i] for i in xrange(N)])
# Scale precision to the number of neighbours
taux = tau*Wplus
return normal_like(value,mu,taux)


We can just define mu_phi similarly and wrap it in a pymc3.DensityDist, however, doing so usually results in a very slow model (both in compiling and sampling). In general, porting pymc2 code into pymc3 (or even generally porting WinBUGS, JAGS, or Stan code into PyMC3) that use a for loops tend to perform poorly in theano, the backend of PyMC3.

The underlying mechanism in PyMC3 is very different compared to PyMC2, using for loops to generate RV or stacking multiple RV with arguments such as [pm.Binomial('obs%'%i, p[i], n) for i in range(K)] generate unnecessary large number of nodes in theano graph, which then slows down compilation appreciably.

The easiest way is to move the loop out of pm.Model. And usually is not difficult to do. For example, in Stan you can have a transformed data{} block; in PyMC3 you just need to compute it before defining your Model.

If it is absolutely necessary to use a for loop, you can use a theano loop (i.e., theano.scan), which you can find some introduction on the theano website and see a usecase in PyMC3 timeseries distribution.

## PyMC3 implementation using theano.scan#

So lets try to implement the CAR model using theano.scan. First we create a theano function with theano.scan and check if it really works by comparing its result to the for-loop.

value = np.asarray(
np.random.randn(
N,
),
dtype=theano.config.floatX,
)

maxwz = max([sum(w) for w in weights])
N = len(weights)
wmat = np.zeros((N, maxwz))
amat = np.zeros((N, maxwz), dtype="int32")
for i, w in enumerate(weights):
wmat[i, np.arange(len(w))] = w

# defining the tensor variables
x = tt.vector("x")
x.tag.test_value = value
w = tt.matrix("w")
# provide Theano with a default test-value
w.tag.test_value = wmat
a = tt.matrix("a", dtype="int32")
a.tag.test_value = amat

def get_mu(w, a):
a1 = tt.cast(a, "int32")
return tt.sum(w * x[a1]) / tt.sum(w)

results, _ = theano.scan(fn=get_mu, sequences=[w, a])
compute_elementwise = theano.function(inputs=[x, w, a], outputs=results)

print(compute_elementwise(value, wmat, amat))

def mu_phi(value):
N = len(weights)
# Calculate mu based on average of neighbours
mu = np.array([np.sum(weights[i] * value[adj[i]]) / Wplus[i] for i in range(N)])
return mu

print(mu_phi(value))

[-0.03256903  1.15369772 -0.88367923  0.19739633  0.2086316  -0.60637122
0.0821804  -1.37764807 -0.29132894  0.05172231 -0.14216446  0.08377788
0.39272752 -0.44500841  0.09997776  0.26546058  0.66625458  0.02811006
0.28666674  0.78806007 -0.13624863  0.03447581  0.07280037 -0.57371348
-0.14919659 -0.03653583 -0.39487311  1.04195921  0.14203152 -0.01594515
-0.01352197  0.00947096  0.37859961 -0.38082245  0.55939024 -0.10175728
-0.02190934  0.34201749 -0.41573475 -0.07256629  0.39001316  0.06973286
0.0839806   0.3099865  -0.27618815 -0.55965603  0.39102644 -0.4616728
-0.78256164  0.11348734  0.29724215  0.37958576  0.22718645  0.09669855
0.0420775   0.26230918]
[-0.03256903  1.15369772 -0.88367923  0.19739633  0.2086316  -0.60637122
0.0821804  -1.37764807 -0.29132894  0.05172231 -0.14216446  0.08377788
0.39272752 -0.44500841  0.09997776  0.26546058  0.66625458  0.02811006
0.28666674  0.78806007 -0.13624863  0.03447581  0.07280037 -0.57371348
-0.14919659 -0.03653583 -0.39487311  1.04195921  0.14203152 -0.01594515
-0.01352197  0.00947096  0.37859961 -0.38082245  0.55939024 -0.10175728
-0.02190934  0.34201749 -0.41573475 -0.07256629  0.39001316  0.06973286
0.0839806   0.3099865  -0.27618815 -0.55965603  0.39102644 -0.4616728
-0.78256164  0.11348734  0.29724215  0.37958576  0.22718645  0.09669855
0.0420775   0.26230918]


Since it produces the same result as the original for-loop, we will wrap it as a new distribution with a log-likelihood function in PyMC3.

class CAR(distribution.Continuous):
"""
Conditional Autoregressive (CAR) distribution

Parameters
----------
a : list of adjacency information
w : list of weight information
tau : precision at each location
"""

def __init__(self, w, a, tau, *args, **kwargs):
super().__init__(*args, **kwargs)
self.a = a = tt.as_tensor_variable(a)
self.w = w = tt.as_tensor_variable(w)
self.tau = tau * tt.sum(w, axis=1)
self.mode = 0.0

def get_mu(self, x):
def weight_mu(w, a):
a1 = tt.cast(a, "int32")
return tt.sum(w * x[a1]) / tt.sum(w)

mu_w, _ = scan(fn=weight_mu, sequences=[self.w, self.a])

return mu_w

def logp(self, x):
mu_w = self.get_mu(x)
tau = self.tau
return tt.sum(continuous.Normal.dist(mu=mu_w, tau=tau).logp(x))


We then use it in our PyMC3 version of the CAR model:

with pm.Model() as model1:
# Vague prior on intercept
beta0 = pm.Normal("beta0", mu=0.0, tau=1.0e-5)
# Vague prior on covariate effect
beta1 = pm.Normal("beta1", mu=0.0, tau=1.0e-5)

# Random effects (hierarchial) prior
tau_h = pm.Gamma("tau_h", alpha=3.2761, beta=1.81)
# Spatial clustering prior
tau_c = pm.Gamma("tau_c", alpha=1.0, beta=1.0)

# Regional random effects
theta = pm.Normal("theta", mu=0.0, tau=tau_h, shape=N)
mu_phi = CAR("mu_phi", w=wmat, a=amat, tau=tau_c, shape=N)

# Zero-centre phi
phi = pm.Deterministic("phi", mu_phi - tt.mean(mu_phi))

# Mean model
mu = pm.Deterministic("mu", tt.exp(logE + beta0 + beta1 * aff + theta + phi))

# Likelihood
Yi = pm.Poisson("Yi", mu=mu, observed=O)

# Marginal SD of heterogeniety effects
sd_h = pm.Deterministic("sd_h", tt.std(theta))
# Marginal SD of clustering (spatial) effects
sd_c = pm.Deterministic("sd_c", tt.std(phi))
# Proportion sptial variance
alpha = pm.Deterministic("alpha", sd_c / (sd_h + sd_c))

infdata1 = pm.sample(
1000,
tune=500,
cores=4,
target_accept=0.9,
max_treedepth=15,
return_inferencedata=True,
)

Auto-assigning NUTS sampler...

8.20% [16398/200000 00:16<03:03 Average Loss = 203.36]
Convergence achieved at 16500
Interrupted at 16,499 [8%]: Average Loss = 345.94
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_phi, theta, tau_c, tau_h, beta1, beta0]

100.00% [6000/6000 05:00<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 500 tune and 1_000 draw iterations (2_000 + 4_000 draws total) took 301 seconds.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.


Note: there are some hidden problems with the model, some regions of the parameter space are quite difficult to sample from. Here I am using ADVI as initialization, which gives a smaller variance of the mass matrix. It keeps the sampler around the mode.

az.plot_trace(infdata1, var_names=["alpha", "sd_h", "sd_c"]);


We also got a lot of Rhat warning, that’s because the Zero-centre phi introduce unidentification to the model:

summary1 = az.summary(infdata1)
summary1[summary1["r_hat"] > 1.05]

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
mu_phi[0] 175.365 368.522 -217.156 799.593 178.151 137.750 4.0 4.0 5.0 11.0 2.41
mu_phi[1] 175.264 368.514 -217.062 799.332 178.147 137.744 4.0 4.0 5.0 11.0 2.41
mu_phi[2] 175.289 368.512 -217.037 799.785 178.148 137.745 4.0 4.0 5.0 11.0 2.41
mu_phi[3] 174.516 368.515 -217.975 798.649 178.148 137.719 4.0 4.0 5.0 11.0 2.41
mu_phi[4] 175.343 368.515 -217.005 799.412 178.148 137.746 4.0 4.0 5.0 11.0 2.41
mu_phi[5] 175.188 368.503 -217.167 799.809 178.143 137.738 4.0 4.0 5.0 11.0 2.41
mu_phi[6] 175.139 368.513 -217.232 799.189 178.147 137.739 4.0 4.0 5.0 11.0 2.41
mu_phi[7] 175.207 368.504 -217.303 799.763 178.145 137.741 4.0 4.0 5.0 11.0 2.41
mu_phi[8] 175.007 368.516 -217.048 799.499 178.149 137.736 4.0 4.0 5.0 11.0 2.41
mu_phi[9] 175.007 368.508 -217.321 799.023 178.145 137.733 4.0 4.0 5.0 11.0 2.41
mu_phi[10] 175.326 368.514 -217.090 799.384 178.148 137.746 4.0 4.0 5.0 11.0 2.41
mu_phi[11] 175.310 368.511 -217.104 799.357 178.147 137.744 4.0 4.0 5.0 11.0 2.41
mu_phi[12] 175.198 368.518 -217.064 799.598 178.149 137.742 4.0 4.0 5.0 11.0 2.41
mu_phi[13] 174.180 368.509 -218.106 798.798 178.146 137.705 4.0 4.0 5.0 11.0 2.41
mu_phi[14] 174.533 368.503 -217.864 798.439 178.143 137.716 4.0 4.0 5.0 11.0 2.41
mu_phi[15] 174.772 368.511 -217.671 798.663 178.146 137.726 4.0 4.0 5.0 11.0 2.41
mu_phi[16] 174.974 368.513 -217.195 799.203 178.147 137.733 4.0 4.0 5.0 11.0 2.41
mu_phi[17] 174.299 368.513 -218.131 798.458 178.147 137.711 4.0 4.0 5.0 11.0 2.41
mu_phi[18] 175.184 368.521 -217.285 799.266 178.151 137.743 4.0 4.0 5.0 11.0 2.41
mu_phi[19] 174.391 368.511 -218.213 798.576 178.147 137.714 4.0 4.0 5.0 11.0 2.41
mu_phi[20] 174.487 368.509 -218.084 798.590 178.146 137.716 4.0 4.0 5.0 11.0 2.41
mu_phi[21] 174.706 368.510 -217.803 798.459 178.145 137.723 4.0 4.0 5.0 11.0 2.41
mu_phi[22] 174.255 368.506 -218.227 798.131 178.145 137.708 4.0 4.0 5.0 11.0 2.41
mu_phi[23] 173.866 368.511 -218.869 797.673 178.147 137.695 4.0 4.0 5.0 11.0 2.41
mu_phi[24] 174.513 368.508 -217.918 798.342 178.145 137.716 4.0 4.0 5.0 11.0 2.41
mu_phi[25] 174.244 368.505 -218.070 798.446 178.143 137.706 4.0 4.0 5.0 11.0 2.41
mu_phi[26] 174.052 368.511 -218.570 798.030 178.146 137.701 4.0 4.0 5.0 11.0 2.41
mu_phi[27] 174.277 368.517 -218.181 798.366 178.149 137.711 4.0 4.0 5.0 11.0 2.41
mu_phi[28] 174.437 368.507 -217.987 798.423 178.145 137.714 4.0 4.0 5.0 11.0 2.41
mu_phi[29] 173.827 368.507 -218.562 797.972 178.144 137.691 4.0 4.0 5.0 11.0 2.41
mu_phi[30] 173.997 368.510 -218.535 797.960 178.146 137.699 4.0 4.0 5.0 11.0 2.41
mu_phi[31] 174.095 368.508 -218.345 798.311 178.145 137.702 4.0 4.0 5.0 11.0 2.41
mu_phi[32] 174.113 368.515 -218.415 798.264 178.148 137.705 4.0 4.0 5.0 11.0 2.41
mu_phi[33] 173.855 368.507 -218.763 797.645 178.144 137.693 4.0 4.0 5.0 11.0 2.41
mu_phi[34] 174.056 368.506 -218.382 798.185 178.145 137.701 4.0 4.0 5.0 11.0 2.41
mu_phi[35] 174.049 368.506 -218.506 798.046 178.145 137.701 4.0 4.0 5.0 11.0 2.41
mu_phi[36] 174.002 368.500 -218.261 798.144 178.142 137.697 4.0 4.0 5.0 11.0 2.41
mu_phi[37] 173.679 368.508 -218.934 797.950 178.146 137.688 4.0 4.0 5.0 11.0 2.41
mu_phi[38] 173.925 368.506 -218.584 797.757 178.144 137.696 4.0 4.0 5.0 11.0 2.41
mu_phi[39] 173.744 368.511 -218.538 798.085 178.147 137.691 4.0 4.0 5.0 11.0 2.41
mu_phi[40] 173.782 368.508 -218.434 797.948 178.146 137.692 4.0 4.0 5.0 11.0 2.41
mu_phi[41] 173.789 368.502 -218.704 797.772 178.142 137.688 4.0 4.0 5.0 11.0 2.42
mu_phi[42] 173.979 368.506 -218.531 798.007 178.143 137.696 4.0 4.0 5.0 11.0 2.41
mu_phi[43] 173.670 368.505 -218.977 797.518 178.144 137.686 4.0 4.0 5.0 11.0 2.41
mu_phi[44] 173.936 368.505 -218.600 797.965 178.143 137.694 4.0 4.0 5.0 11.0 2.41
mu_phi[45] 173.785 368.508 -218.366 798.158 178.146 137.692 4.0 4.0 5.0 11.0 2.41
mu_phi[46] 173.668 368.508 -218.812 797.661 178.147 137.688 4.0 4.0 5.0 11.0 2.41
mu_phi[47] 173.610 368.513 -219.018 797.681 178.148 137.687 4.0 4.0 5.0 11.0 2.41
mu_phi[48] 173.568 368.507 -219.224 797.455 178.146 137.684 4.0 4.0 5.0 11.0 2.41
mu_phi[49] 174.281 368.503 -218.298 798.272 178.142 137.706 4.0 4.0 5.0 11.0 2.41
mu_phi[50] 173.636 368.503 -219.057 797.613 178.143 137.683 4.0 4.0 5.0 11.0 2.41
mu_phi[51] 173.595 368.507 -218.613 797.867 178.145 137.684 4.0 4.0 5.0 11.0 2.41
mu_phi[52] 173.574 368.506 -218.595 798.050 178.145 137.683 4.0 4.0 5.0 11.0 2.41
mu_phi[53] 173.548 368.510 -218.972 797.742 178.147 137.684 4.0 4.0 5.0 11.0 2.41
mu_phi[54] 174.022 368.511 -218.659 798.018 178.146 137.700 4.0 4.0 5.0 11.0 2.41
mu_phi[55] 173.948 368.511 -218.511 797.932 178.146 137.697 4.0 4.0 5.0 11.0 2.41
az.plot_forest(
infdata1,
kind="ridgeplot",
var_names=["phi"],
combined=False,
ridgeplot_overlap=3,
ridgeplot_alpha=0.25,
colors="white",
figsize=(9, 7),
);

az.plot_posterior(infdata1, var_names=["alpha"]);


theano.scan is much faster than using a python for loop, but it is still quite slow. One approach for improving it is to use linear algebra. That is, we should try to find a way to use matrix multiplication instead of looping (if you have experience in using MATLAB, it is the same philosophy). In our case, we can totally do that.

For a similar problem, you can also have a look of my port of Lee and Wagenmakers’ book. For example, in Chapter 19, the Stan code use a for loop to generate the likelihood function, and I generate the matrix outside and use matrix multiplication etc to archive the same purpose.

## PyMC3 implementation using matrix “trick”#

Again, we try on some simulated data to make sure the implementation is correct.

maxwz = max([sum(w) for w in weights])
N = len(weights)
wmat2 = np.zeros((N, N))
amat2 = np.zeros((N, N), dtype="int32")
amat2[i, a] = 1
wmat2[i, a] = weights[i]

value = np.asarray(
np.random.randn(
N,
),
dtype=theano.config.floatX,
)

print(np.sum(value * amat2, axis=1) / np.sum(wmat2, axis=1))

def mu_phi(value):
N = len(weights)
# Calculate mu based on average of neighbours
mu = np.array([np.sum(weights[i] * value[adj[i]]) / Wplus[i] for i in range(N)])
return mu

print(mu_phi(value))

[ 2.25588227e-01 -8.20210477e-01  7.14165670e-01  3.07251143e-01
2.92335128e-01 -5.55562972e-01 -7.35797893e-01  5.94110931e-01
-2.72989587e-01 -6.21479051e-01  2.82642230e-01 -7.27562577e-01
9.72894293e-03  2.15441546e-01  3.16608110e-01 -5.38879039e-01
1.94526179e-01  1.19573877e-01 -2.02771767e-01  3.22618936e-01
-1.65824491e-01 -1.21021601e+00  3.18736883e-01 -4.98013820e-01
6.09599805e-02  2.40591863e-01 -6.29610020e-04  4.11071335e-02
-2.92231409e-01 -4.30276690e-01  2.06124487e-01 -6.86948082e-02
-1.26157102e-01 -9.18024137e-02  6.81225357e-01 -3.23761738e-01
5.23813476e-01 -1.10626670e-01 -6.48965436e-01 -9.01164776e-01
2.04888768e-01 -2.66684442e-01 -2.17412828e-01 -1.33620266e-01
-1.46843946e-01 -2.11092061e-01  1.34030315e-01 -3.39439739e-01
-7.96140962e-01  2.34936057e-01 -1.86926585e-01 -4.09354076e-01
-1.03251174e-01 -5.12496749e-01 -4.13624363e-01  1.80577442e-02]
[ 2.25588227e-01 -8.20210477e-01  7.14165670e-01  3.07251143e-01
2.92335128e-01 -5.55562972e-01 -7.35797893e-01  5.94110931e-01
-2.72989587e-01 -6.21479051e-01  2.82642230e-01 -7.27562577e-01
9.72894293e-03  2.15441546e-01  3.16608110e-01 -5.38879039e-01
1.94526179e-01  1.19573877e-01 -2.02771767e-01  3.22618936e-01
-1.65824491e-01 -1.21021601e+00  3.18736883e-01 -4.98013820e-01
6.09599805e-02  2.40591863e-01 -6.29610020e-04  4.11071335e-02
-2.92231409e-01 -4.30276690e-01  2.06124487e-01 -6.86948082e-02
-1.26157102e-01 -9.18024137e-02  6.81225357e-01 -3.23761738e-01
5.23813476e-01 -1.10626670e-01 -6.48965436e-01 -9.01164776e-01
2.04888768e-01 -2.66684442e-01 -2.17412828e-01 -1.33620266e-01
-1.46843946e-01 -2.11092061e-01  1.34030315e-01 -3.39439739e-01
-7.96140962e-01  2.34936057e-01 -1.86926585e-01 -4.09354076e-01
-1.03251174e-01 -5.12496749e-01 -4.13624363e-01  1.80577442e-02]


Now create a new CAR distribution with the matrix multiplication instead of theano.scan to get the mu

class CAR2(distribution.Continuous):
"""
Conditional Autoregressive (CAR) distribution

Parameters
----------
w : weight matrix
tau : precision at each location
"""

def __init__(self, w, a, tau, *args, **kwargs):
super().__init__(*args, **kwargs)
self.a = a = tt.as_tensor_variable(a)
self.w = w = tt.as_tensor_variable(w)
self.tau = tau * tt.sum(w, axis=1)
self.mode = 0.0

def logp(self, x):
tau = self.tau
w = self.w
a = self.a

mu_w = tt.sum(x * a, axis=1) / tt.sum(w, axis=1)
return tt.sum(continuous.Normal.dist(mu=mu_w, tau=tau).logp(x))

with pm.Model() as model2:
# Vague prior on intercept
beta0 = pm.Normal("beta0", mu=0.0, tau=1.0e-5)
# Vague prior on covariate effect
beta1 = pm.Normal("beta1", mu=0.0, tau=1.0e-5)

# Random effects (hierarchial) prior
tau_h = pm.Gamma("tau_h", alpha=3.2761, beta=1.81)
# Spatial clustering prior
tau_c = pm.Gamma("tau_c", alpha=1.0, beta=1.0)

# Regional random effects
theta = pm.Normal("theta", mu=0.0, tau=tau_h, shape=N)
mu_phi = CAR2("mu_phi", w=wmat2, a=amat2, tau=tau_c, shape=N)

# Zero-centre phi
phi = pm.Deterministic("phi", mu_phi - tt.mean(mu_phi))

# Mean model
mu = pm.Deterministic("mu", tt.exp(logE + beta0 + beta1 * aff + theta + phi))

# Likelihood
Yi = pm.Poisson("Yi", mu=mu, observed=O)

# Marginal SD of heterogeniety effects
sd_h = pm.Deterministic("sd_h", tt.std(theta))
# Marginal SD of clustering (spatial) effects
sd_c = pm.Deterministic("sd_c", tt.std(phi))
# Proportion sptial variance
alpha = pm.Deterministic("alpha", sd_c / (sd_h + sd_c))

infdata2 = pm.sample(
1000,
tune=500,
cores=4,
target_accept=0.9,
max_treedepth=15,
return_inferencedata=True,
)

Auto-assigning NUTS sampler...

8.24% [16487/200000 00:03<00:39 Average Loss = 202.66]
Convergence achieved at 16900
Interrupted at 16,899 [8%]: Average Loss = 338.61
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu_phi, theta, tau_c, tau_h, beta1, beta0]

100.00% [6000/6000 00:56<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 500 tune and 1_000 draw iterations (2_000 + 4_000 draws total) took 57 seconds.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.


As you can see, it is appreciably faster using matrix multiplication.

summary2 = az.summary(infdata2)
summary2[summary2["r_hat"] > 1.05]

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
mu_phi[0] 29.076 245.989 -425.818 388.013 111.959 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[1] 29.000 245.980 -425.952 387.926 111.954 84.510 5.0 5.0 6.0 14.0 1.86
mu_phi[2] 28.994 246.000 -425.650 388.224 111.966 84.519 5.0 5.0 6.0 14.0 1.86
mu_phi[3] 28.252 245.971 -426.657 387.110 111.949 84.506 5.0 5.0 6.0 14.0 1.86
mu_phi[4] 29.056 245.992 -425.765 387.904 111.960 84.514 5.0 5.0 6.0 14.0 1.86
mu_phi[5] 28.908 245.997 -425.880 388.138 111.966 84.519 5.0 5.0 6.0 14.0 1.86
mu_phi[6] 28.859 245.983 -425.961 387.830 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[7] 28.940 245.993 -425.777 387.958 111.964 84.517 5.0 5.0 6.0 14.0 1.86
mu_phi[8] 28.720 245.987 -426.006 387.682 111.958 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[9] 28.735 245.982 -426.243 387.557 111.956 84.511 5.0 5.0 6.0 14.0 1.86
mu_phi[10] 29.044 245.990 -425.548 388.004 111.960 84.514 5.0 5.0 6.0 14.0 1.86
mu_phi[11] 29.028 245.995 -425.704 388.058 111.963 84.516 5.0 5.0 6.0 14.0 1.86
mu_phi[12] 28.917 245.987 -425.954 387.818 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[13] 27.902 245.984 -426.922 386.918 111.956 84.511 5.0 5.0 6.0 14.0 1.86
mu_phi[14] 28.230 245.988 -426.431 387.230 111.960 84.515 5.0 5.0 6.0 14.0 1.86
mu_phi[15] 28.487 245.988 -426.110 387.548 111.960 84.515 5.0 5.0 6.0 14.0 1.86
mu_phi[16] 28.688 245.981 -426.016 387.680 111.955 84.510 5.0 5.0 6.0 14.0 1.86
mu_phi[17] 28.020 245.977 -426.864 386.932 111.953 84.509 5.0 5.0 6.0 14.0 1.86
mu_phi[18] 28.892 245.986 -425.857 387.955 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[19] 28.115 245.985 -426.847 386.944 111.955 84.511 5.0 5.0 6.0 14.0 1.86
mu_phi[20] 28.191 245.989 -426.539 387.026 111.963 84.517 5.0 5.0 6.0 14.0 1.86
mu_phi[21] 28.426 245.984 -426.382 387.549 111.958 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[22] 27.969 245.993 -426.554 387.104 111.961 84.515 5.0 5.0 6.0 14.0 1.86
mu_phi[23] 27.590 245.987 -427.080 386.605 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[24] 28.220 245.986 -426.470 387.350 111.961 84.515 5.0 5.0 6.0 14.0 1.86
mu_phi[25] 27.973 245.988 -426.703 387.040 111.960 84.514 5.0 5.0 6.0 14.0 1.86
mu_phi[26] 27.774 245.988 -427.010 386.806 111.955 84.510 5.0 5.0 6.0 14.0 1.86
mu_phi[27] 28.003 245.978 -426.589 387.169 111.952 84.508 5.0 5.0 6.0 14.0 1.86
mu_phi[28] 28.155 245.985 -426.398 387.266 111.958 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[29] 27.551 245.989 -426.999 386.797 111.958 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[30] 27.717 245.985 -427.012 386.761 111.955 84.510 5.0 5.0 6.0 14.0 1.86
mu_phi[31] 27.819 245.986 -426.992 386.784 111.956 84.511 5.0 5.0 6.0 14.0 1.86
mu_phi[32] 27.835 245.984 -427.027 386.634 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[33] 27.570 245.986 -427.049 386.650 111.958 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[34] 27.776 245.983 -427.043 386.654 111.955 84.510 5.0 5.0 6.0 14.0 1.86
mu_phi[35] 27.759 245.994 -426.826 386.938 111.960 84.515 5.0 5.0 6.0 14.0 1.86
mu_phi[36] 27.717 245.990 -426.968 386.638 111.960 84.514 5.0 5.0 6.0 14.0 1.86
mu_phi[37] 27.395 245.990 -427.283 386.474 111.959 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[38] 27.640 245.994 -427.011 386.533 111.961 84.515 5.0 5.0 6.0 14.0 1.86
mu_phi[39] 27.450 245.985 -427.123 386.452 111.956 84.511 5.0 5.0 6.0 14.0 1.86
mu_phi[40] 27.489 245.992 -427.026 386.447 111.959 84.514 5.0 5.0 6.0 14.0 1.86
mu_phi[41] 27.510 245.989 -427.096 386.690 111.958 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[42] 27.707 245.993 -427.099 386.600 111.962 84.516 5.0 5.0 6.0 14.0 1.86
mu_phi[43] 27.388 245.986 -427.276 386.562 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[44] 27.655 245.985 -426.991 386.699 111.958 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[45] 27.500 245.987 -427.099 386.421 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[46] 27.378 245.983 -427.165 386.493 111.956 84.511 5.0 5.0 6.0 14.0 1.86
mu_phi[47] 27.323 245.983 -427.393 386.392 111.955 84.511 5.0 5.0 6.0 14.0 1.86
mu_phi[48] 27.281 245.989 -427.254 386.360 111.958 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[49] 27.981 245.992 -426.561 386.968 111.964 84.518 5.0 5.0 6.0 14.0 1.86
mu_phi[50] 27.347 245.990 -427.411 386.545 111.959 84.514 5.0 5.0 6.0 14.0 1.86
mu_phi[51] 27.307 245.984 -427.552 386.183 111.957 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[52] 27.282 245.991 -427.218 386.431 111.960 84.514 5.0 5.0 6.0 14.0 1.86
mu_phi[53] 27.261 245.989 -427.296 386.452 111.958 84.513 5.0 5.0 6.0 14.0 1.86
mu_phi[54] 27.743 245.989 -427.031 386.763 111.958 84.512 5.0 5.0 6.0 14.0 1.86
mu_phi[55] 27.669 245.984 -426.962 386.760 111.956 84.511 5.0 5.0 6.0 14.0 1.86
az.plot_forest(
infdata2,
kind="ridgeplot",
var_names=["phi"],
combined=False,
ridgeplot_overlap=3,
ridgeplot_alpha=0.25,
colors="white",
figsize=(9, 7),
);

az.plot_trace(infdata2, var_names=["alpha", "sd_h", "sd_c"]);

az.plot_posterior(infdata2, var_names=["alpha"]);


## PyMC3 implementation using Matrix multiplication#

There are almost always multiple ways to formulate a particular model. Some approaches work better than the others under different contexts (size of your dataset, properties of the sampler, etc).

In this case, we can express the CAR prior as:

$\phi \sim \mathcal{N}(0, [D_\tau (I - \alpha B)]^{-1}).$

You can find more details in the original Stan case study. You might come across similar constructs in Gaussian Process, which result in a zero-mean Gaussian distribution conditioned on a covariance function.

In the Stan Code, matrix D is generated in the model using a transformed data{} block:

transformed data{
vector[n] zeros;
matrix<lower = 0>[n, n] D;
{
vector[n] W_rowsums;
for (i in 1:n) {
W_rowsums[i] = sum(W[i, ]);
}
D = diag_matrix(W_rowsums);
}
zeros = rep_vector(0, n);
}


We can generate the same matrix quite easily:

X = np.hstack((np.ones((N, 1)), stats.zscore(aff, ddof=1)[:, None]))
W = wmat2
D = np.diag(W.sum(axis=1))
log_offset = logE[:, None]


Then in the Stan model:

model {
phi ~ multi_normal_prec(zeros, tau * (D - alpha * W));
...
}


since the precision matrix just generated by some matrix multiplication, we can do just that in PyMC3:

with pm.Model() as model3:
# Vague prior on intercept and effect
beta = pm.Normal("beta", mu=0.0, tau=1.0, shape=(2, 1))

# Priors for spatial random effects
tau = pm.Gamma("tau", alpha=2.0, beta=2.0)
alpha = pm.Uniform("alpha", lower=0, upper=1)
phi = pm.MvNormal("phi", mu=0, tau=tau * (D - alpha * W), shape=(1, N))

# Mean model
mu = pm.Deterministic("mu", tt.exp(tt.dot(X, beta) + phi.T + log_offset))

# Likelihood
Yi = pm.Poisson("Yi", mu=mu.ravel(), observed=O)

infdata3 = pm.sample(1000, tune=2000, cores=4, target_accept=0.85, return_inferencedata=True)

Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [phi, alpha, tau, beta]

100.00% [12000/12000 02:10<00:00 Sampling 4 chains, 21 divergences]
Sampling 4 chains for 2_000 tune and 1_000 draw iterations (8_000 + 4_000 draws total) took 131 seconds.
There were 21 divergences after tuning. Increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.9207219941260574, but should be close to 0.85. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The estimated number of effective samples is smaller than 200 for some parameters.

az.plot_trace(infdata3, var_names=["alpha", "beta", "tau"]);

az.plot_posterior(infdata3, var_names=["alpha"]);


Notice that since the model parameterization is different than in the WinBUGS model, the alpha can’t be interpreted in the same way.

## PyMC3 implementation using Sparse Matrix#

Note that in the node $$\phi \sim \mathcal{N}(0, [D_\tau (I - \alpha B)]^{-1})$$, we are computing the log-likelihood for a multivariate Gaussian distribution, which might not scale well in high-dimensions. We can take advantage of the fact that the covariance matrix here $$[D_\tau (I - \alpha B)]^{-1}$$ is sparse, and there are faster ways to compute its log-likelihood.

For example, a more efficient sparse representation of the CAR in Stan:

functions {
/**
* Return the log probability of a proper conditional autoregressive (CAR) prior
* with a sparse representation for the adjacency matrix
*
* @param phi Vector containing the parameters with a CAR prior
* @param tau Precision parameter for the CAR prior (real)
* @param alpha Dependence (usually spatial) parameter for the CAR prior (real)
* @param W_sparse Sparse representation of adjacency matrix (int array)
* @param n Length of phi (int)
* @param W_n Number of adjacent pairs (int)
* @param D_sparse Number of neighbors for each location (vector)
* @param lambda Eigenvalues of D^{-1/2}*W*D^{-1/2} (vector)
*
* @return Log probability density of CAR prior up to additive constant
*/
real sparse_car_lpdf(vector phi, real tau, real alpha,
int[,] W_sparse, vector D_sparse, vector lambda, int n, int W_n) {
row_vector[n] phit_D; // phi' * D
row_vector[n] phit_W; // phi' * W
vector[n] ldet_terms;

phit_D = (phi .* D_sparse)';
phit_W = rep_row_vector(0, n);
for (i in 1:W_n) {
phit_W[W_sparse[i, 1]] = phit_W[W_sparse[i, 1]] + phi[W_sparse[i, 2]];
phit_W[W_sparse[i, 2]] = phit_W[W_sparse[i, 2]] + phi[W_sparse[i, 1]];
}

for (i in 1:n) ldet_terms[i] = log1m(alpha * lambda[i]);
return 0.5 * (n * log(tau)
+ sum(ldet_terms)
- tau * (phit_D * phi - alpha * (phit_W * phi)));
}
}


with the data transformed in the model:

transformed data {
int W_sparse[W_n, 2];   // adjacency pairs
vector[n] D_sparse;     // diagonal of D (number of neighbors for each site)
vector[n] lambda;       // eigenvalues of invsqrtD * W * invsqrtD

{ // generate sparse representation for W
int counter;
counter = 1;
// loop over upper triangular part of W to identify neighbor pairs
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
if (W[i, j] == 1) {
W_sparse[counter, 1] = i;
W_sparse[counter, 2] = j;
counter = counter + 1;
}
}
}
}
for (i in 1:n) D_sparse[i] = sum(W[i]);
{
vector[n] invsqrtD;
for (i in 1:n) {
invsqrtD[i] = 1 / sqrt(D_sparse[i]);
}
}
}


and the likelihood:

model {
phi ~ sparse_car(tau, alpha, W_sparse, D_sparse, lambda, n, W_n);
}


This is quite a lot of code to digest, so my general approach is to compare the intermediate steps (whenever possible) with Stan. In this case, I will try to compute tau, alpha, W_sparse, D_sparse, lambda, n, W_n outside of the Stan model in R and compare with my own implementation.

Below is a Sparse CAR implementation in PyMC3 (see also here). Again, we try to avoid using any looping, as in Stan.

import scipy

class Sparse_CAR(distribution.Continuous):
"""
Sparse Conditional Autoregressive (CAR) distribution

Parameters
----------
alpha : spatial smoothing term
tau : precision at each location
"""

def __init__(self, alpha, W, tau, *args, **kwargs):
self.alpha = alpha = tt.as_tensor_variable(alpha)
self.tau = tau = tt.as_tensor_variable(tau)
D = W.sum(axis=0)
n, m = W.shape
self.n = n
self.median = self.mode = self.mean = 0
super().__init__(*args, **kwargs)

# eigenvalues of D^−1/2 * W * D^−1/2
Dinv_sqrt = np.diag(1 / np.sqrt(D))
DWD = np.matmul(np.matmul(Dinv_sqrt, W), Dinv_sqrt)
self.lam = scipy.linalg.eigvalsh(DWD)

# sparse representation of W
w_sparse = scipy.sparse.csr_matrix(W)
self.W = theano.sparse.as_sparse_variable(w_sparse)
self.D = tt.as_tensor_variable(D)

# Precision Matrix (inverse of Covariance matrix)
# d_sparse = scipy.sparse.csr_matrix(np.diag(D))
# self.D = theano.sparse.as_sparse_variable(d_sparse)
# self.Phi = self.tau * (self.D - self.alpha*self.W)

def logp(self, x):
logtau = self.n * tt.log(tau)
logdet = tt.log(1 - self.alpha * self.lam).sum()

# tau * ((phi .* D_sparse)' * phi - alpha * (phit_W * phi))
Wx = theano.sparse.dot(self.W, x)
tau_dot_x = self.D * x.T - self.alpha * Wx.ravel()
logquad = self.tau * tt.dot(x.ravel(), tau_dot_x.ravel())

# logquad = tt.dot(x.T, theano.sparse.dot(self.Phi, x)).sum()
return 0.5 * (logtau + logdet - logquad)

with pm.Model() as model4:
# Vague prior on intercept and effect
beta = pm.Normal("beta", mu=0.0, tau=1.0, shape=(2, 1))

# Priors for spatial random effects
tau = pm.Gamma("tau", alpha=2.0, beta=2.0)
alpha = pm.Uniform("alpha", lower=0, upper=1)
phi = Sparse_CAR("phi", alpha, W, tau, shape=(N, 1))

# Mean model
mu = pm.Deterministic("mu", tt.exp(tt.dot(X, beta) + phi + log_offset))

# Likelihood
Yi = pm.Poisson("Yi", mu=mu.ravel(), observed=O)

infdata4 = pm.sample(1000, tune=2000, cores=4, target_accept=0.85, return_inferencedata=True)

Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [phi, alpha, tau, beta]

100.00% [12000/12000 00:26<00:00 Sampling 4 chains, 1 divergences]
Sampling 4 chains for 2_000 tune and 1_000 draw iterations (8_000 + 4_000 draws total) took 27 seconds.
There was 1 divergence after tuning. Increase target_accept or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.

az.plot_trace(infdata4, var_names=["alpha", "beta", "tau"]);

az.plot_posterior(infdata4, var_names=["alpha"])

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f7ef1e9ebb0>],
dtype=object)


As you can see above, the sparse representation returns the same estimates, while being much faster than any other implementation.

## A few other warnings#

In Stan, there is an option to write a generated quantities block for sample generation. Doing the similar in pymc3, however, is not recommended.

Consider the following simple sample:

# Data
x = np.array([1.1, 1.9, 2.3, 1.8])
n = len(x)

with pm.Model() as model1:
# prior
mu = pm.Normal('mu', mu=0, tau=.001)
sigma = pm.Uniform('sigma', lower=0, upper=10)
# observed
xi = pm.Normal('xi', mu=mu, tau=1/(sigma**2), observed=x)
# generation
p = pm.Deterministic('p', pm.math.sigmoid(mu))
count = pm.Binomial('count', n=10, p=p, shape=10)


where we intended to use

count = pm.Binomial('count', n=10, p=p, shape=10)


to generate posterior prediction. However, if the new RV added to the model is a discrete variable it can cause weird turbulence to the trace. You can see issue #1990 for related discussion.

## Final remarks#

In this notebook, most of the parameter conventions (e.g., using tau when defining a Normal distribution) and choice of priors are strictly matched with the original code in Winbugs or Stan. However, it is important to note that merely porting the code from one probabilistic programming language to the another is not necessarily the best practice. The aim is not just to run the code in PyMC3, but to make sure the model is appropriate so that it returns correct estimates, and runs efficiently (fast sampling).

For example, as @aseyboldt pointed out here and here, non-centered parametrizations are often a better choice than the centered parametrizations. In our case here, phi is following a zero-mean Normal distribution, thus it can be left out in the beginning and used to scale the values afterwards. Often, doing this can avoid correlations in the posterior (it will be slower in some cases, however).

Another thing to keep in mind is that models can be sensitive to choices of prior distributions; for example, you can have a hard time using Normal variables with a large sd as prior. Gelman often recommends Cauchy or StudentT (i.e., weakly-informative priors). More information on prior choice can be found on the Stan wiki.

There are always ways to improve code. Since our computational graph with pm.Model() consist of theano objects, we can always do print(VAR_TO_CHECK.tag.test_value) right after the declaration or computation to check the shape. For example, in our last example, as suggested by @aseyboldt there seem to be a lot of correlation in the posterior. That probably slows down NUTS quite a bit. As a debugging tool and guide for reparametrization you can look at the singular value decomposition of the standardized samples from the trace – basically the eigenvalues of the correlation matrix. If the problem is high dimensional you can use stuff from scipy.sparse.linalg to only compute the largest singular value:

from scipy import linalg, sparse

vals = np.array([model.dict_to_array(v) for v in trace[1000:]]).T
vals[:] -= vals.mean(axis=1)[:, None]
vals[:] /= vals.std(axis=1)[:, None]

U, S, Vh = sparse.linalg.svds(vals, k=20)


Then look at plt.plot(S) to see if any principal components are obvious, and check which variables are contributing by looking at the singular vectors: plt.plot(U[:, -1] ** 2). You can get the indices by looking at model.bijection.ordering.vmap.

Another great way to check the correlations in the posterior is to do a pairplot of the posterior (if your model doesn’t contain too many parameters). You can see quite clearly if and where the the posterior parameters are correlated.

az.plot_pair(infdata1, var_names=["beta0", "beta1", "tau_h", "tau_c"], divergences=True);

az.plot_pair(infdata2, var_names=["beta0", "beta1", "tau_h", "tau_c"], divergences=True);

az.plot_pair(infdata3, var_names=["beta", "tau", "alpha"], divergences=True);

az.plot_pair(infdata4, var_names=["beta", "tau", "alpha"], divergences=True);

%load_ext watermark
%watermark -n -u -v -iv -w

arviz  0.9.0
theano 1.0.5
pymc3  3.9.3
scipy  1.5.0
pandas 1.0.5
numpy  1.18.5
last updated: Fri Aug 14 2020

CPython 3.8.3
IPython 7.16.1
watermark 2.0.2