GLM: Robust Linear Regression#

This tutorial first appeard as a post in small series on Bayesian GLMs on:

  1. The Inference Button: Bayesian GLMs made easy with PyMC3

  2. This world is far from Normal(ly distributed): Robust Regression in PyMC3

  3. The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3

In this blog post I will write about:

  • How a few outliers can largely affect the fit of linear regression models.

  • How replacing the normal likelihood with Student T distribution produces robust regression.

  • How this can easily be done with the Bambi library by passing a family object or passing the family name as an argument.

This is the second part of a series on Bayesian GLMs (click here for part I about linear regression). In this prior post I described how minimizing the squared distance of the regression line is the same as maximizing the likelihood of a Normal distribution with the mean coming from the regression line. This latter probabilistic expression allows us to easily formulate a Bayesian linear regression model.

This worked splendidly on simulated data. The problem with simulated data though is that it’s, well, simulated. In the real world things tend to get more messy and assumptions like normality are easily violated by a few outliers.

Lets see what happens if we add some outliers to our simulated data from the last post.

Again, import our modules.

%matplotlib inline

import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano

print(f"Running on pymc3 v{pm.__version__}")
Running on pymc3 v3.11.2
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
%config InlineBackend.figure_format = 'retina'
az.style.use("arviz-darkgrid")

Create some toy data but also add some outliers.

size = 100
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + rng.normal(scale=0.5, size=size)

# Add outliers
x_out = np.append(x, [0.1, 0.15, 0.2])
y_out = np.append(y, [8, 6, 9])

data = pd.DataFrame(dict(x=x_out, y=y_out))

Plot the data together with the true regression line (the three points in the upper left corner are the outliers we added).

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax.plot(x_out, y_out, "x", label="sampled data")
ax.plot(x, true_regression_line, label="true regression line", lw=2.0)
plt.legend(loc=0);
../_images/5f30b27e171c5370cf148a856dc8183d957b3104fd0695c1681f358681cf18cf.png

Robust Regression#

Lets see what happens if we estimate our Bayesian linear regression model using the bambi. This function takes a formulae string to describe the linear model and adds a Normal likelihood by default.

model = bmb.Model("y ~ x", data)
trace = model.fit(draws=2000, cores=2)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [y_sigma, Intercept, x]
100.00% [6000/6000 00:01<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 2 seconds.

To evaluate the fit, I am plotting the posterior predictive regression lines by taking regression parameters from the posterior distribution and plotting a regression line for each (this is all done inside of plot_posterior_predictive()).

plt.figure(figsize=(7, 5))
plt.plot(x_out, y_out, "x", label="data")
pm.plot_posterior_predictive_glm(trace, samples=100, label="posterior predictive regression lines")
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")

plt.legend(loc=0);
/home/ada/.local/lib/python3.8/site-packages/pymc3/plots/posteriorplot.py:59: DeprecationWarning: The `plot_posterior_predictive_glm` function will migrate to Arviz in a future release. 
Keep up to date with `ArviZ <https://arviz-devs.github.io/arviz/>`_ for future updates.
  warnings.warn(
../_images/dafe96bab599111c6de00e2d8e82d668cc9e28d503d55b2c36f8b25a58148585.png

As you can see, the fit is quite skewed and we have a fair amount of uncertainty in our estimate as indicated by the wide range of different posterior predictive regression lines. Why is this? The reason is that the normal distribution does not have a lot of mass in the tails and consequently, an outlier will affect the fit strongly.

A Frequentist would estimate a Robust Regression and use a non-quadratic distance measure to evaluate the fit.

But what’s a Bayesian to do? Since the problem is the light tails of the Normal distribution we can instead assume that our data is not normally distributed but instead distributed according to the Student T distribution which has heavier tails as shown next (I read about this trick in “The Kruschke”, aka the puppy-book; but I think Gelman was the first to formulate this).

Lets look at those two distributions to get a feel for them.

normal_dist = pm.Normal.dist(mu=0, sigma=1)
t_dist = pm.StudentT.dist(mu=0, lam=1, nu=1)
x_eval = np.linspace(-8, 8, 300)
plt.plot(x_eval, theano.tensor.exp(normal_dist.logp(x_eval)).eval(), label="Normal", lw=2.0)
plt.plot(x_eval, theano.tensor.exp(t_dist.logp(x_eval)).eval(), label="Student T", lw=2.0)
plt.xlabel("x")
plt.ylabel("Probability density")
plt.legend();
../_images/ee72780772954d7decd819d177a16860d3a0ae484e9959ec7bacbbe59c4dd844.png

As you can see, the probability of values far away from the mean (0 in this case) are much more likely under the T distribution than under the Normal distribution.

To define the usage of a T distribution in Bambi we can pass the distribution name to the family argument – t – that specifies that our data is Student T-distributed. Note that this is the same syntax as R and statsmodels use.

model_robust = bmb.Model("y ~ x", data, family="t")
model_robust.set_priors({"nu": bmb.Prior("Gamma", alpha=3, beta=1)})
trace_robust = model_robust.fit(draws=2000, cores=2)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [y_nu, y_lam, Intercept, x]
100.00% [6000/6000 00:02<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 3 seconds.
plt.figure(figsize=(7, 5))
plt.plot(x_out, y_out, "x")
pm.plot_posterior_predictive_glm(trace_robust, label="posterior predictive regression lines")
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")
plt.legend();
/home/ada/.local/lib/python3.8/site-packages/pymc3/plots/posteriorplot.py:59: DeprecationWarning: The `plot_posterior_predictive_glm` function will migrate to Arviz in a future release. 
Keep up to date with `ArviZ <https://arviz-devs.github.io/arviz/>`_ for future updates.
  warnings.warn(
../_images/8c6383b2b7f69808c2c0345ceacef00958f021b093ee6d718ce99847dc5d8e5e.png

There, much better! The outliers are barely influencing our estimation at all because our likelihood function assumes that outliers are much more probable than under the Normal distribution.

Summary#

  • Bambi allows you to pass in a family argument that contains information about the likelihood.

  • By changing the likelihood from a Normal distribution to a Student T distribution – which has more mass in the tails – we can perform Robust Regression.

The next post will be about logistic regression in PyMC3 and what the posterior and oatmeal have in common.

Extensions:

  • The Student-T distribution has, besides the mean and variance, a third parameter called degrees of freedom that describes how much mass should be put into the tails. Here it is set to 1 which gives maximum mass to the tails (setting this to infinity results in a Normal distribution!). One could easily place a prior on this rather than fixing it which I leave as an exercise for the reader ;).

  • T distributions can be used as priors as well. I will show this in a future post on hierarchical GLMs.

  • How do we test if our data is normal or violates that assumption in an important way? Check out this great blog post by Allen Downey.

Author: Thomas Wiecki

%load_ext watermark
%watermark -n -u -v -iv -w -p xarray
Last updated: Thu Aug 26 2021

Python implementation: CPython
Python version       : 3.8.10
IPython version      : 7.25.0

xarray: 0.17.0

pandas    : 1.2.1
numpy     : 1.21.0
theano    : 1.1.2
matplotlib: 3.3.4
arviz     : 0.11.2
pymc3     : 3.11.2
bambi     : 0.6.0

Watermark: 2.2.0