GLM in PyMC3: Out-Of-Sample Predictions#
In this notebook I explore the glm module of PyMC3. I am particularly interested in the model definition using patsy formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread “Out of sample” predictions with the GLM sub-module on the (great!) forum discourse.pymc.io/, thank you all for your input!
Resources
-
In particular check GLM: Logistic Regression
Bambi, a more complete implementation of the GLM submodule which also allows for mixed-effects models.
Prepare Notebook#
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style(style="darkgrid", rc={"axes.facecolor": ".9", "grid.color": ".8"})
sns.set_palette(palette="deep")
sns_c = sns.color_palette(palette="deep")
import arviz as az
import patsy
import pymc3 as pm
from pymc3 import glm
plt.rcParams["figure.figsize"] = [7, 6]
plt.rcParams["figure.dpi"] = 100
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Generate Sample Data#
We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.
SEED = 42
np.random.seed(SEED)
# Number of data points.
n = 250
# Create features.
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
# Define target variable.
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = 1 / (1 + np.exp(-z))
y = np.random.binomial(n=1, p=p, size=n)
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))
df.head()
| x1 | x2 | y | |
|---|---|---|---|
| 0 | 0.993428 | -2.521768 | 0 |
| 1 | -0.276529 | 1.835724 | 0 |
| 2 | 1.295377 | 4.244312 | 1 |
| 3 | 3.046060 | 2.064931 | 1 |
| 4 | -0.468307 | -3.038740 | 1 |
Let us do some exploration of the data:
sns.pairplot(
data=df, kind="scatter", height=2, plot_kws={"color": sns_c[1]}, diag_kws={"color": sns_c[2]}
);
\(x_1\) and \(x_2\) are not correlated.
\(x_1\) and \(x_2\) do not seem to separate the \(y\)-classes independently.
The distribution of \(y\) is not highly unbalanced.
fig, ax = plt.subplots()
sns_c_div = sns.diverging_palette(240, 10, n=2)
sns.scatterplot(x="x1", y="x2", data=df, hue="y", palette=[sns_c_div[0], sns_c_div[-1]])
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.set(title="Sample Data", xlim=(-9, 9), ylim=(-9, 9));
Prepare Data for Modeling#
I wanted to use the classmethod from_formula (see documentation), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using patsy directly and then use class pymc3.glm.linear.GLM (this was motivated by going into the source code).
# Define model formula.
formula = "y ~ x1 * x2"
# Create features.
y, x = patsy.dmatrices(formula_like=formula, data=df)
y = np.asarray(y).flatten()
labels = x.design_info.column_names
x = np.asarray(x)
As pointed out on the thread (thank you @Nicky!), we need to keep the labels of the features in the design matrix.
print(f"labels = {labels}")
labels = ['Intercept', 'x1', 'x2', 'x1:x2']
Now we do a train-test split.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED)
Define and Fit the Model#
We now specify the model in PyMC3.
with pm.Model() as model:
# Set data container.
data = pm.Data("data", x_train)
# Define GLM family.
family = pm.glm.families.Binomial()
# Set priors.
priors = {
"Intercept": pm.Normal.dist(mu=0, sd=10),
"x1": pm.Normal.dist(mu=0, sd=10),
"x2": pm.Normal.dist(mu=0, sd=10),
"x1:x2": pm.Normal.dist(mu=0, sd=10),
}
# Specify model.
glm.GLM(y=y_train, x=data, family=family, intercept=False, labels=labels, priors=priors)
# Configure sampler.
trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (5 chains in 4 jobs)
NUTS: [x1:x2, x2, x1, Intercept]
Sampling 5 chains for 1_000 tune and 5_000 draw iterations (5_000 + 25_000 draws total) took 30 seconds.
# Plot chains.
az.plot_trace(data=trace);
az.summary(trace)
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| Intercept | -0.536 | 0.262 | -1.043 | -0.060 | 0.002 | 0.002 | 14479.0 | 16189.0 | 1.0 |
| x1 | 0.616 | 0.211 | 0.223 | 1.013 | 0.002 | 0.001 | 14573.0 | 14635.0 | 1.0 |
| x2 | -0.865 | 0.221 | -1.285 | -0.466 | 0.002 | 0.001 | 12761.0 | 14415.0 | 1.0 |
| x1:x2 | 1.790 | 0.335 | 1.162 | 2.403 | 0.003 | 0.002 | 12686.0 | 14676.0 | 1.0 |
The chains look good.
Generate Out-Of-Sample Predictions#
Now we generate predictions on the test set.
# Update data reference.
pm.set_data({"data": x_test}, model=model)
# Generate posterior samples.
ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000)
# Compute the point prediction by taking the mean
# and defining the category via a threshold.
p_test_pred = ppc_test["y"].mean(axis=0)
y_test_pred = (p_test_pred >= 0.5).astype("int")
Evaluate Model#
First let us compute the accuracy on the test set.
from sklearn.metrics import accuracy_score
print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")
accuracy = 0.787
Next, we plot the roc curve and compute the auc.
from sklearn.metrics import RocCurveDisplay, auc, roc_curve
fpr, tpr, thresholds = roc_curve(
y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
)
roc_auc = auc(fpr, tpr)
fig, ax = plt.subplots()
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
ax.set(title="ROC");
The model is performing as expected (we of course know the data generating process, which is almost never the case in practical applications).
Model Decision Boundary#
Finally we will describe and plot the model decision boundary, which is the space defined as
where \(p\) denotes the probability of belonging to the class \(y=1\) output by the model. To make this set explicit, we simply write the condition in terms of the model parametrization:
which implies
Solving for \(x_2\) we get the formula
Observe that this curve is a hyperbola centered at the singularity point \(x_1 = - \beta_2 / \beta_{12}\).
Let us now plot the model decision boundary using a grid:
# Construct grid.
x1_grid = np.linspace(start=-9, stop=9, num=300)
x2_grid = x1_grid
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
# Create features on the grid.
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))
x_grid_ext = np.asarray(x_grid_ext)
# Generate model predictions on the grid.
pm.set_data({"data": x_grid_ext}, model=model)
ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000)
Now we compute the model decision boundary on the grid for visualization purposes.
numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid)
denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid
bd_grid = numerator / denominator
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
grid_df["p"] = ppc_grid["y"].mean(axis=0)
grid_df.sort_values("p", inplace=True)
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
We finally get the plot and the predictions on the test set:
fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)
sns.scatterplot(
x=x_test[:, 1].flatten(),
y=x_test[:, 2].flatten(),
hue=y_test,
palette=[sns_c_div[0], sns_c_div[-1]],
ax=ax,
)
sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax)
ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3)
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.lines[0].set_linestyle("dotted")
ax.set(title="Model Decision Boundary", xlim=(-9, 9), ylim=(-9, 9), xlabel="x1", ylabel="x2");
Remark: Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a Deterministic variable as in Bayesian Analysis with Python (Second edition) - Chapter 4.
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Tue Mar 02 2021
Python implementation: CPython
Python version : 3.7.6
IPython version : 7.15.0
patsy : 0.5.1
arviz : 0.11.2
seaborn : 0.11.0
matplotlib: 3.3.4
pandas : 1.2.2
pymc3 : 3.11.1
sys : 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:33:30)
[Clang 9.0.1 ]
numpy : 1.20.1
Watermark: 2.2.0