GLM: Robust Regression using Custom Likelihood for Outlier Classification#

Using PyMC for Robust Regression with Outlier Detection using the Hogg 2010 Signal vs Noise method.

Modelling concept:

  • This model uses a custom likelihood function as a mixture of two likelihoods, one for the main data-generating function (a linear model that we care about), and one for outliers.

  • The model does not marginalize and thus gives us a classification of outlier-hood for each datapoint

  • The dataset is tiny and hardcoded into this Notebook. It contains errors in both the x and y, but we will deal here with only errors in y.

Complementary approaches:

  • This is a complementary approach to the Student-T robust regression as illustrated in the example generalized_linear_models/GLM-robust, and that approach is also compared

  • See also a gist by Dan FM that he published after a quick twitter conversation - his “Hogg improved” model uses this same model structure and cleverly marginalizes over the outlier class but also observes it during sampling using a pm.Deterministic <- this is really nice

  • The likelihood evaluation is essentially a copy of eqn 17 in “Data analysis recipes: Fitting a model to data” - Hogg et al. [2010]

  • The model is adapted specifically from Jake Vanderplas’ and Brigitta Sipocz’ implementation in the AstroML book [Željko Ivezić et al., 2014, Vanderplas et al., 2012]

Setup#

Installation Notes#

See the original project README for full details on dependencies and about the environment where the notebook was written in. A summary on the environment where this notebook was executed is available in the “Watermark” section.

Attention

This notebook uses libraries that are not PyMC dependencies and therefore need to be installed specifically to run this notebook. Open the dropdown below for extra guidance.

Extra dependencies install instructions

In order to run this notebook (either locally or on binder) you won’t only need a working PyMC installation with all optional dependencies, but also to install some extra dependencies. For advise on installing PyMC itself, please refer to Installation

You can install these dependencies with your preferred package manager, we provide as an example the pip and conda commands below.

$ pip install seaborn

Note that if you want (or need) to install the packages from inside the notebook instead of the command line, you can install the packages by running a variation of the pip command:

import sys

!{sys.executable} -m pip install seaborn

You should not run !pip install as it might install the package in a different environment and not be available from the Jupyter notebook even if installed.

Another alternative is using conda instead:

$ conda install seaborn

when installing scientific python packages with conda, we recommend using conda forge

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Imports#

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns

from matplotlib.lines import Line2D
from scipy import stats
az.style.use("arviz-darkgrid")

Load Data#

We’ll use the Hogg 2010 data available at https://github.com/astroML/astroML/blob/master/astroML/datasets/hogg2010test.py

It’s a very small dataset so for convenience, it’s hardcoded below

# cut & pasted directly from the fetch_hogg2010test() function
# identical to the original dataset as hardcoded in the Hogg 2010 paper

dfhogg = pd.DataFrame(
    np.array(
        [
            [1, 201, 592, 61, 9, -0.84],
            [2, 244, 401, 25, 4, 0.31],
            [3, 47, 583, 38, 11, 0.64],
            [4, 287, 402, 15, 7, -0.27],
            [5, 203, 495, 21, 5, -0.33],
            [6, 58, 173, 15, 9, 0.67],
            [7, 210, 479, 27, 4, -0.02],
            [8, 202, 504, 14, 4, -0.05],
            [9, 198, 510, 30, 11, -0.84],
            [10, 158, 416, 16, 7, -0.69],
            [11, 165, 393, 14, 5, 0.30],
            [12, 201, 442, 25, 5, -0.46],
            [13, 157, 317, 52, 5, -0.03],
            [14, 131, 311, 16, 6, 0.50],
            [15, 166, 400, 34, 6, 0.73],
            [16, 160, 337, 31, 5, -0.52],
            [17, 186, 423, 42, 9, 0.90],
            [18, 125, 334, 26, 8, 0.40],
            [19, 218, 533, 16, 6, -0.78],
            [20, 146, 344, 22, 5, -0.56],
        ]
    ),
    columns=["id", "x", "y", "sigma_y", "sigma_x", "rho_xy"],
)

dfhogg["id"] = dfhogg["id"].apply(lambda x: "p{}".format(int(x)))
dfhogg.set_index("id", inplace=True)
dfhogg.head()
x y sigma_y sigma_x rho_xy
id
p1 201.0 592.0 61.0 9.0 -0.84
p2 244.0 401.0 25.0 4.0 0.31
p3 47.0 583.0 38.0 11.0 0.64
p4 287.0 402.0 15.0 7.0 -0.27
p5 203.0 495.0 21.0 5.0 -0.33

1. Basic EDA#

Exploratory Data Analysis

Note:

  • this is very rudimentary so we can quickly get to the pymc3 part

  • the dataset contains errors in both the x and y, but we will deal here with only errors in y.

  • see the Hogg et al. [2010] for more detail

with plt.rc_context({"figure.constrained_layout.use": False}):
    gd = sns.jointplot(
        x="x",
        y="y",
        data=dfhogg,
        kind="scatter",
        height=6,
        marginal_kws={"bins": 12, "kde": True, "kde_kws": {"cut": 1}},
        joint_kws={"edgecolor": "w", "linewidth": 1.2, "s": 80},
    )

_ = gd.ax_joint.errorbar(
    "x", "y", "sigma_y", "sigma_x", fmt="none", ecolor="#4878d0", data=dfhogg, zorder=10
)

for idx, r in dfhogg.iterrows():
    _ = gd.ax_joint.annotate(
        text=idx,
        xy=(r["x"], r["y"]),
        xycoords="data",
        xytext=(10, 10),
        textcoords="offset points",
        color="#999999",
        zorder=1,
    )


_ = gd.fig.suptitle(
    (
        "Original datapoints in Hogg 2010 dataset\n"
        + "showing marginal distributions and errors sigma_x, sigma_y"
    ),
    y=1.05,
);
../_images/2f5bb1629ac3be1ecc2b35fc544b64a20ee1b7aa236ca42498440805f6ce8fc3.png

Observe:

  • Even judging just by eye, you can see these observations mostly fall on / around a straight line with positive gradient

  • It looks like a few of the datapoints may be outliers from such a line

  • Measurement error (independently on x and y) varies across the observations

2. Basic Feature Engineering#

2.1 Transform and standardize dataset#

It’s common practice to standardize the input values to a linear model, because this leads to coefficients sitting in the same range and being more directly comparable. e.g. this is noted in Gelman [2008]

So, following Gelman’s paper above, we’ll divide by 2 s.d. here

  • since this model is very simple, we just standardize directly, rather than using e.g. a scikit-learn FunctionTransformer

  • ignoring rho_xy for now

Additional note on scaling the output feature y and measurement error sigma_y:

  • This is unconventional - typically you wouldn’t scale an output feature

  • However, in the Hogg model we fit a custom two-part likelihood function of Normals which encourages a globally minimised log-loss by allowing outliers to fit to their own Normal distribution. This outlier distribution is specified using a stdev stated as an offset sigma_y_out from sigma_y

  • This offset value has the effect of requiring sigma_y to be restated in the same scale as the stdev of y

Standardize (mean center and divide by 2 sd):

dfhoggs = (dfhogg[["x", "y"]] - dfhogg[["x", "y"]].mean(0)) / (2 * dfhogg[["x", "y"]].std(0))
dfhoggs["sigma_x"] = dfhogg["sigma_x"] / (2 * dfhogg["x"].std())
dfhoggs["sigma_y"] = dfhogg["sigma_y"] / (2 * dfhogg["y"].std())
with plt.rc_context({"figure.constrained_layout.use": False}):
    gd = sns.jointplot(
        x="x",
        y="y",
        data=dfhoggs,
        kind="scatter",
        height=6,
        marginal_kws={"bins": 12, "kde": True, "kde_kws": {"cut": 1}},
        joint_kws={"edgecolor": "w", "linewidth": 1, "s": 80},
    )
gd.ax_joint.errorbar("x", "y", "sigma_y", "sigma_x", fmt="none", ecolor="#4878d0", data=dfhoggs)
gd.fig.suptitle(
    (
        "Quick view to confirm action of\n"
        + "standardizing datapoints in Hogg 2010 dataset\n"
        + "showing marginal distributions and errors sigma_x, sigma_y"
    ),
    y=1.08,
);
../_images/cb8f1229e30ef3c1a7625d6d0acabd0cbf98cd34daa201ca28b23a3f35fec33d.png

3. Simple Linear Model with no Outlier Correction#

3.1 Specify Model#

Before we get more advanced, I want to demo the fit of a simple linear model with Normal likelihood function. The priors are also Normally distributed, so this behaves like an OLS with Ridge Regression (L2 norm).

Note: the dataset also has sigma_x and rho_xy available, but for this exercise, We’ve chosen to only use sigma_y

\[\hat{y} \sim \mathcal{N}(\beta^{T} \vec{x}_{i}, \sigma_{i})\]

where:

  • \(\beta\) = \(\{1, \beta_{j \in X_{j}}\}\) <— linear coefs in \(X_{j}\), in this case 1 + x

  • \(\sigma\) = error term <— in this case we set this to an unpooled \(\sigma_{i}\): the measured error sigma_y for each datapoint

coords = {"coefs": ["intercept", "slope"], "datapoint_id": dfhoggs.index}
with pm.Model(coords=coords) as mdl_ols:
    # Define weakly informative Normal priors to give Ridge regression
    beta = pm.Normal("beta", mu=0, sigma=10, dims="coefs")

    # Define linear model
    y_est = beta[0] + beta[1] * dfhoggs["x"]

    # Define Normal likelihood
    pm.Normal("y", mu=y_est, sigma=dfhoggs["sigma_y"], observed=dfhoggs["y"], dims="datapoint_id")

pm.model_to_graphviz(mdl_ols)
../_images/5435a742fd4cdea7455782b3bd158803caad25eb5f7a500a7e6bc84ce1415e72.svg

3.2 Fit Model#

Note we are purposefully missing a step here for prior predictive checks.

3.2.1 Sample Posterior#

with mdl_ols:
    trc_ols = pm.sample(
        tune=5000,
        draws=500,
        chains=4,
        cores=4,
        init="advi+adapt_diag",
        n_init=50000,
    )
100.00% [22000/22000 00:01<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 5_000 tune and 500 draw iterations (20_000 + 2_000 draws total) took 1 seconds.

3.2.2 View Diagnostics#

NOTE: We will illustrate this OLS fit and compare to the datapoints in the final comparison plot

Traceplot

_ = az.plot_trace(trc_ols, compact=False)
../_images/05af11d7b4397830355434801ea902b61f4eded5788fc6e8b5263f63c69b9845.png

Plot posterior joint distribution (since the model has only 2 coeffs, we can easily view this as a 2D joint distribution)

marginal_kwargs = {"kind": "kde", "rug": True, "color": "C0"}
ax = az.plot_pair(
    trc_ols,
    var_names="beta",
    marginals=True,
    kind=["scatter", "kde"],
    scatter_kwargs={"color": "C0", "alpha": 0.4},
    marginal_kwargs=marginal_kwargs,
)
fig = ax[0, 0].get_figure()
fig.suptitle("Posterior joint distribution (mdl_ols)", y=1.02);
../_images/9e30a9295de79f14195c23eef3df8b920c3ff6f79155b9c34a1a2f9842538963.png

4. Simple Linear Model with Robust Student-T Likelihood#

I’ve added this brief section in order to directly compare the Student-T based method exampled in Thomas Wiecki’s notebook in the PyMC3 documentation

Instead of using a Normal distribution for the likelihood, we use a Student-T which has fatter tails. In theory this allows outliers to have a smaller influence in the likelihood estimation. This method does not produce inlier / outlier flags (it marginalizes over such a classification) but it’s simpler and faster to run than the Signal Vs Noise model below, so a comparison seems worthwhile.

4.1 Specify Model#

In this modification, we allow the likelihood to be more robust to outliers (have fatter tails)

\[\hat{y} \sim \text{StudentT}(\beta^{T} \vec{x}_{i}, \sigma_{i}, \nu)\]

where:

  • \(\beta\) = \(\{1, \beta_{j \in X_{j}}\}\) <— linear coefs in \(X_{j}\), in this case 1 + x

  • \(\sigma\) = error term <— in this case we set this to an unpooled \(\sigma_{i}\): the measured error sigma_y for each datapoint

  • \(\nu\) = degrees of freedom <— allowing a pdf with fat tails and thus less influence from outlier datapoints

Note: the dataset also has sigma_x and rho_xy available, but for this exercise, I’ve chosen to only use sigma_y

with pm.Model(coords=coords) as mdl_studentt:
    # define weakly informative Normal priors to give Ridge regression
    beta = pm.Normal("beta", mu=0, sigma=10, dims="coefs")

    # define linear model
    y_est = beta[0] + beta[1] * dfhoggs["x"]

    # define prior for StudentT degrees of freedom
    # InverseGamma has nice properties:
    # it's continuous and has support x ∈ (0, inf)
    nu = pm.InverseGamma("nu", alpha=1, beta=1)

    # define Student T likelihood
    pm.StudentT(
        "y", mu=y_est, sigma=dfhoggs["sigma_y"], nu=nu, observed=dfhoggs["y"], dims="datapoint_id"
    )

pm.model_to_graphviz(mdl_studentt)
../_images/0f437cf964a1f336d40df87f76dc122047636335794c1a6db0df09b049d320db.svg

4.2 Fit Model#

4.2.1 Sample Posterior#

with mdl_studentt:
    trc_studentt = pm.sample(
        tune=5000,
        draws=500,
        chains=4,
        cores=4,
        init="advi+adapt_diag",
        n_init=50000,
    )
100.00% [22000/22000 00:02<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 5_000 tune and 500 draw iterations (20_000 + 2_000 draws total) took 2 seconds.

4.2.2 View Diagnostics#

NOTE: We will illustrate this StudentT fit and compare to the datapoints in the final comparison plot

Traceplot

_ = az.plot_trace(trc_studentt, compact=False);
../_images/dc18d259681b361a990d5ff0aa6650893f4e3d409a24dae737dedc2b5d556700.png

Plot posterior joint distribution

marginal_kwargs["color"] = "C1"
ax = az.plot_pair(
    trc_studentt,
    var_names="beta",
    kind=["scatter", "kde"],
    divergences=True,
    marginals=True,
    marginal_kwargs=marginal_kwargs,
    scatter_kwargs={"color": "C1", "alpha": 0.4},
)
ax[0, 0].get_figure().suptitle("Posterior joint distribution (mdl_studentt)");
../_images/68551660b758ad71ce83e31eb48768a0a7d19d54229d0632be0a2e9d8f9661c5.png

4.2.3 View the shift in posterior joint distributions from OLS to StudentT#

marginal_kwargs["rug"] = False
marginal_kwargs["color"] = "C0"
ax = az.plot_pair(
    trc_ols,
    var_names="beta",
    kind=["scatter", "kde"],
    divergences=True,
    figsize=[12, 12],
    marginals=True,
    marginal_kwargs=marginal_kwargs,
    scatter_kwargs={"color": "C0", "alpha": 0.4},
    kde_kwargs={"contour_kwargs": {"colors": "C0"}},
)

marginal_kwargs["color"] = "C1"
az.plot_pair(
    trc_studentt,
    var_names="beta",
    kind=["scatter", "kde"],
    divergences=True,
    marginals=True,
    marginal_kwargs=marginal_kwargs,
    scatter_kwargs={"color": "C1", "alpha": 0.4},
    kde_kwargs={"contour_kwargs": {"colors": "C1"}},
    ax=ax,
)

ax[0, 0].get_figure().suptitle(
    "Posterior joint distributions\n(showing general movement from OLS to StudentT)"
);
../_images/a15b55b859d0d8cccf03ebd3c46026c1b9a4e7eab52e911b2080e1267998d09c.png

Observe:

  • Both beta parameters appear to have greater variance than in the OLS regression

  • This is due to \(\nu\) appearing to converge to a value nu ~ 1, indicating that a fat-tailed likelihood has a better fit than a thin-tailed one

  • The parameter beta[intercept] has moved much closer to \(0\), which is interesting: if the theoretical relationship y ~ f(x) has no offset, then for this mean-centered dataset, the intercept should indeed be \(0\): it might easily be getting pushed off-course by outliers in the OLS model.

  • The parameter beta[slope] has accordingly become greater: perhaps moving closer to the theoretical function f(x)


5. Linear Model with Custom Likelihood to Distinguish Outliers: Hogg Method#

Please read the paper (Hogg 2010) and Jake Vanderplas’ code for more complete information about the modelling technique.

The general idea is to create a ‘mixture’ model whereby datapoints can be described by either:

  1. the proposed (linear) model (thus a datapoint is an inlier), or

  2. a second model, which for convenience we also propose to be linear, but allow it to have a different mean and variance (thus a datapoint is an outlier)

5.1 Specify Model#

The likelihood is evaluated over a mixture of two likelihoods, one for ‘inliers’, one for ‘outliers’. A Bernoulli distribution is used to randomly assign datapoints in N to either the inlier or outlier groups, and we sample the model as usual to infer robust model parameters and inlier / outlier flags:

\[ \mathcal{logL} = \sum_{i}^{i=N} log \left[ \frac{(1 - B_{i})}{\sqrt{2 \pi \sigma_{in}^{2}}} exp \left( - \frac{(x_{i} - \mu_{in})^{2}}{2\sigma_{in}^{2}} \right) \right] + \sum_{i}^{i=N} log \left[ \frac{B_{i}}{\sqrt{2 \pi (\sigma_{in}^{2} + \sigma_{out}^{2})}} exp \left( - \frac{(x_{i}- \mu_{out})^{2}}{2(\sigma_{in}^{2} + \sigma_{out}^{2})} \right) \right] \]

where:

  • \(B_{i}\) is Bernoulli-distributed \(B_{i} \in \{0_{(inlier)},1_{(outlier)}\}\)

  • \(\mu_{in} = \beta^{T} \vec{x}_{i}\) as before for inliers, where \(\beta\) = \(\{1, \beta_{j \in X_{j}}\}\) <— linear coefs in \(X_{j}\), in this case 1 + x

  • \(\sigma_{in}\) = noise term <— in this case we set this to an unpooled \(\sigma_{i}\): the measured error sigma_y for each datapoint

  • \(\mu_{out}\) <— is a random pooled variable for outliers

  • \(\sigma_{out}\) = additional noise term <— is a random unpooled variable for outliers

This notebook uses Potential() class combined with logp to create a likelihood and build this model where a feature is not observed, here the Bernoulli switching variable.

Usage of Potential is not discussed. Other resources are available that are worth referring to for details on Potential usage:

with pm.Model(coords=coords) as mdl_hogg:
    # state input data as Theano shared vars
    tsv_x = pm.ConstantData("tsv_x", dfhoggs["x"], dims="datapoint_id")
    tsv_y = pm.ConstantData("tsv_y", dfhoggs["y"], dims="datapoint_id")
    tsv_sigma_y = pm.ConstantData("tsv_sigma_y", dfhoggs["sigma_y"], dims="datapoint_id")

    # weakly informative Normal priors (L2 ridge reg) for inliers
    beta = pm.Normal("beta", mu=0, sigma=10, dims="coefs")

    # linear model for mean for inliers
    y_est_in = beta[0] + beta[1] * tsv_x  # dims="obs_id"

    # very weakly informative mean for all outliers
    y_est_out = pm.Normal("y_est_out", mu=0, sigma=10, initval=pm.floatX(0.0))

    # very weakly informative prior for additional variance for outliers
    sigma_y_out = pm.HalfNormal("sigma_y_out", sigma=10, initval=pm.floatX(1.0))

    # create in/outlier distributions to get a logp evaluated on the observed y
    # this is not strictly a pymc3 likelihood, but behaves like one when we
    # evaluate it within a Potential (which is minimised)
    inlier_logp = pm.logp(pm.Normal.dist(mu=y_est_in, sigma=tsv_sigma_y), tsv_y)
    outlier_logp = pm.logp(pm.Normal.dist(mu=y_est_out, sigma=tsv_sigma_y + sigma_y_out), tsv_y)

    # frac_outliers only needs to span [0, .5]
    # testval for is_outlier initialised in order to create class asymmetry
    frac_outliers = pm.Uniform("frac_outliers", lower=0.0, upper=0.5)
    is_outlier = pm.Bernoulli(
        "is_outlier",
        p=frac_outliers,
        initval=(np.random.rand(tsv_x.eval().shape[0]) < 0.4) * 1,
        dims="datapoint_id",
    )

    # non-sampled Potential evaluates the Normal.dist.logp's
    potential = pm.Potential(
        "obs",
        ((1 - is_outlier) * inlier_logp).sum() + (is_outlier * outlier_logp).sum(),
    )

5.2 Fit Model#

5.2.1 Sample Posterior#

Note that pm.sample conveniently and automatically creates the compound sampling process to:

  1. sample a Bernoulli variable (the class is_outlier) using a discrete sampler

  2. sample the continuous variables using a continuous sampler

Further note:

  • This also means we can’t initialise using ADVI, so will init using jitter+adapt_diag

  • In order to pass kwargs to a particular stepper, wrap them in a dict addressed to the lowercased name of the stepper e.g. nuts={'target_accept': 0.85}

with mdl_hogg:
    trc_hogg = pm.sample(
        tune=10000,
        draws=500,
        chains=4,
        cores=4,
        init="jitter+adapt_diag",
        nuts={"target_accept": 0.99},
    )
100.00% [42000/42000 00:27<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 10_000 tune and 500 draw iterations (40_000 + 2_000 draws total) took 27 seconds.
/Users/benjamv/mambaforge/envs/pymc_env/lib/python3.10/site-packages/arviz/stats/diagnostics.py:584: RuntimeWarning: invalid value encountered in scalar divide
  (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)

5.2.2 View Diagnostics#

We will illustrate this model fit and compare to the datapoints in the final comparison plot

rvs = ["beta", "y_est_out", "sigma_y_out", "frac_outliers"]
_ = az.plot_trace(trc_hogg, var_names=rvs, compact=False);
../_images/3036eebee456f00bc30da41df86a721c6fd27688f56caad65a5eda436f0d8c37.png

Observe:

  • At the default target_accept = 0.8 there are lots of divergences, indicating this is not a particularly stable model

  • However, at target_accept = 0.9 (and increasing tune from 5000 to 10000), the traces exhibit fewer divergences and appear slightly better behaved.

  • The traces for the inlier model beta parameters, and for outlier model parameter y_est_out (the mean) look reasonably converged

  • The traces for outlier model param y_sigma_out (the additional pooled variance) occasionally go a bit wild

  • It’s interesting that frac_outliers is so dispersed: that’s quite a flat distribution: suggests that there are a few datapoints where their inlier/outlier status is subjective

  • Indeed as Thomas noted in his v2.0 Notebook, because we’re explicitly modeling the latent label (inlier/outlier) as binary choice the sampler could have a problem - rewriting this model into a marginal mixture model would be better.

Simple trace summary inc rhat

az.summary(trc_hogg, var_names=rvs)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
beta[intercept] 0.016 0.031 -0.044 0.069 0.001 0.001 795.0 747.0 1.00
beta[slope] 1.239 0.068 1.124 1.365 0.003 0.002 764.0 845.0 1.01
y_est_out 0.073 0.540 -0.977 1.108 0.032 0.024 433.0 295.0 1.01
sigma_y_out 0.724 0.992 0.073 2.095 0.051 0.036 429.0 423.0 1.01
frac_outliers 0.268 0.109 0.084 0.472 0.005 0.004 554.0 656.0 1.00

Plot posterior joint distribution

(This is a particularly useful diagnostic in this case where we see a lot of divergences in the traces: maybe the model specification leads to weird behaviours)

marginal_kwargs["color"] = "C2"
marginal_kwargs["rug"] = True
x = az.plot_pair(
    data=trc_hogg,
    var_names="beta",
    kind=["kde", "scatter"],
    divergences=True,
    marginals=True,
    marginal_kwargs=marginal_kwargs,
    scatter_kwargs={"color": "C2"},
)
ax[0, 0].get_figure().suptitle("Posterior joint distribution (mdl_hogg)");
../_images/04c5e7e2b5510c1d37b411e178432c6ce60baed48d325242cc4b4f781e9192cf.png

5.2.3 View the shift in posterior joint distributions from OLS to StudentT to Hogg#

kde_kwargs = {"contour_kwargs": {"colors": "C0", "zorder": 4}, "contourf_kwargs": {"alpha": 0}}
marginal_kwargs["rug"] = False
marginal_kwargs["color"] = "C0"
ax = az.plot_pair(
    trc_ols,
    var_names="beta",
    kind="kde",
    divergences=True,
    marginals=True,
    marginal_kwargs={"color": "C0"},
    kde_kwargs=kde_kwargs,
    figsize=(8, 8),
)

marginal_kwargs["color"] = "C1"
kde_kwargs["contour_kwargs"]["colors"] = "C1"
az.plot_pair(
    trc_studentt,
    var_names="beta",
    kind="kde",
    divergences=True,
    marginals=True,
    marginal_kwargs=marginal_kwargs,
    kde_kwargs=kde_kwargs,
    ax=ax,
)

marginal_kwargs["color"] = "C2"
kde_kwargs["contour_kwargs"]["colors"] = "C2"
az.plot_pair(
    data=trc_hogg,
    var_names="beta",
    kind="kde",
    divergences=True,
    marginals=True,
    marginal_kwargs=marginal_kwargs,
    kde_kwargs=kde_kwargs,
    ax=ax,
    show=True,
)
ax[0, 0].get_figure().suptitle(
    "Posterior joint distributions" + "\nOLS, StudentT, and Hogg (inliers)"
);
../_images/b0e0003e6f3f0c5c073a482b288e02a9f0e36101246ed65fda5dd88be980578e.png

Observe:

  • The hogg_inlier and studentt models converge to similar ranges for b0_intercept and b1_slope, indicating that the (unshown) hogg_outlier model might perform a similar job to the fat tails of the studentt model: allowing greater log probability away from the main linear distribution in the datapoints

  • As expected, (since it’s a Normal) the hogg_inlier posterior has thinner tails and more probability mass concentrated about the central values

  • The hogg_inlier model also appears to have moved farther away from both the ols and studentt models on the b0_intercept, suggesting that the outliers really distort that particular dimension

5.3 Declare Outliers#

5.3.1 View ranges for inliers / outlier predictions#

At each step of the traces, each datapoint may be either an inlier or outlier. We hope that the datapoints spend an unequal time being one state or the other, so let’s take a look at the simple count of states for each of the 20 datapoints.

dfm_outlier_results = trc_hogg.posterior.is_outlier.to_dataframe().reset_index()

with plt.rc_context({"figure.constrained_layout.use": False}):
    gd = sns.catplot(
        y="datapoint_id",
        x="is_outlier",
        data=dfm_outlier_results,
        kind="point",
        join=False,
        height=6,
        aspect=2,
    )
_ = gd.fig.axes[0].set(xlim=(-0.05, 1.05), xticks=np.arange(0, 1.1, 0.1))
_ = gd.fig.axes[0].axvline(x=0, color="b", linestyle=":")
_ = gd.fig.axes[0].axvline(x=1, color="r", linestyle=":")
_ = gd.fig.axes[0].yaxis.grid(True, linestyle="-", which="major", color="w", alpha=0.4)
_ = gd.fig.suptitle(
    ("For each datapoint, distribution of outlier classification " + "over the trace"),
    y=1.04,
    fontsize=16,
)
../_images/84ec3879702eef692ef34fc3913b9a4107051c650790c9e0f2ed309eaf25bf82.png

Observe:

  • The plot above shows the proportion of samples in the traces in which each datapoint is marked as an outlier, expressed as a percentage.

  • 3 points [p2, p3, p4] spend >=95% of their time as outliers

  • Note the mean posterior value of frac_outliers ~ 0.27, corresponding to approx 5 or 6 of the 20 datapoints: we might investigate datapoints [p1, p12, p16] to see if they lean towards being outliers

The 95% cutoff we choose is subjective and arbitrary, but I prefer it for now, so let’s declare these 3 to be outliers and see how it looks compared to Jake Vanderplas’ outliers, which were declared in a slightly different way as points with means above 0.68.

5.3.2 Declare outliers#

Note:

  • I will declare outliers to be datapoints that have value == 1 at the 5-percentile cutoff, i.e. in the percentiles from 5 up to 100, their values are 1.

  • Try for yourself altering cutoff to larger values, which leads to an objective ranking of outlier-hood.

cutoff = 0.05
dfhoggs["classed_as_outlier"] = (
    trc_hogg.posterior["is_outlier"].quantile(cutoff, dim=("chain", "draw")) == 1
)
dfhoggs["classed_as_outlier"].value_counts()
False    17
True      3
Name: classed_as_outlier, dtype: int64

Also add flag for points to be investigated. Will use this to annotate final plot

dfhoggs["annotate_for_investigation"] = (
    trc_hogg.posterior["is_outlier"].quantile(0.75, dim=("chain", "draw")) == 1
)
dfhoggs["annotate_for_investigation"].value_counts()
False    15
True      5
Name: annotate_for_investigation, dtype: int64

5.4 Posterior Prediction Plots for OLS vs StudentT vs Hogg “Signal vs Noise”#

import xarray as xr

x = xr.DataArray(np.linspace(-3, 3, 10), dims="plot_dim")

# evaluate outlier posterior distribution for plotting
trc_hogg.posterior["outlier_mean"] = trc_hogg.posterior["y_est_out"].broadcast_like(x)

# evaluate model (inlier) posterior distributions for all 3 models
lm = lambda beta, x: beta.sel(coefs="intercept") + beta.sel(coefs="slope") * x

trc_ols.posterior["y_mean"] = lm(trc_ols.posterior["beta"], x)
trc_studentt.posterior["y_mean"] = lm(trc_studentt.posterior["beta"], x)
trc_hogg.posterior["y_mean"] = lm(trc_hogg.posterior["beta"], x)
with plt.rc_context({"figure.constrained_layout.use": False}):
    gd = sns.FacetGrid(
        dfhoggs,
        height=7,
        hue="classed_as_outlier",
        hue_order=[True, False],
        palette="Set1",
        legend_out=False,
    )

# plot hogg outlier posterior distribution
outlier_mean = az.extract(trc_hogg, var_names="outlier_mean", num_samples=400)
gd.ax.plot(x, outlier_mean, color="C3", linewidth=0.5, alpha=0.2, zorder=1)

# plot the 3 model (inlier) posterior distributions
y_mean = az.extract(trc_ols, var_names="y_mean", num_samples=200)
gd.ax.plot(x, y_mean, color="C0", linewidth=0.5, alpha=0.2, zorder=2)

y_mean = az.extract(trc_studentt, var_names="y_mean", num_samples=200)
gd.ax.plot(x, y_mean, color="C1", linewidth=0.5, alpha=0.2, zorder=3)

y_mean = az.extract(trc_hogg, var_names="y_mean", num_samples=200)
gd.ax.plot(x, y_mean, color="C2", linewidth=0.5, alpha=0.2, zorder=4)

# add legend for regression lines plotted above
line_legend = plt.legend(
    [
        Line2D([0], [0], color="C3"),
        Line2D([0], [0], color="C2"),
        Line2D([0], [0], color="C1"),
        Line2D([0], [0], color="C0"),
    ],
    ["Hogg Inlier", "Hogg Outlier", "Student-T", "OLS"],
    loc="lower right",
    title="Posterior Predictive",
)
gd.ax.add_artist(line_legend)

# plot points
_ = gd.map(
    plt.errorbar,
    "x",
    "y",
    "sigma_y",
    "sigma_x",
    marker="o",
    ls="",
    markeredgecolor="w",
    markersize=10,
    zorder=5,
).add_legend()
gd.ax.legend(loc="upper left", title="Outlier Classification")

# annotate the potential outliers
for idx, r in dfhoggs.loc[dfhoggs["annotate_for_investigation"]].iterrows():
    _ = gd.ax.annotate(
        text=idx,
        xy=(r["x"], r["y"]),
        xycoords="data",
        xytext=(7, 7),
        textcoords="offset points",
        color="k",
        zorder=4,
    )

## create xlims ylims for plotting
x_ptp = np.ptp(dfhoggs["x"].values) / 3.3
y_ptp = np.ptp(dfhoggs["y"].values) / 3.3
xlims = (dfhoggs["x"].min() - x_ptp, dfhoggs["x"].max() + x_ptp)
ylims = (dfhoggs["y"].min() - y_ptp, dfhoggs["y"].max() + y_ptp)
gd.ax.set(ylim=ylims, xlim=xlims)
gd.fig.suptitle(
    (
        "Standardized datapoints in Hogg 2010 dataset, showing "
        "posterior predictive fit for 3 models:\nOLS, StudentT and Hogg "
        '"Signal vs Noise" (inlier vs outlier, custom likelihood)'
    ),
    y=1.04,
    fontsize=14,
);
../_images/bc5da58b9a0f8fd700ec8f6633749b4df034d83bcc70ac92205b24a240bce61e.png

Observe:

The posterior preditive fit for:

  • the OLS model is shown in Green and as expected, it doesn’t appear to fit the majority of our datapoints very well, skewed by outliers

  • the Student-T model is shown in Orange and does appear to fit the ‘main axis’ of datapoints quite well, ignoring outliers

  • the Hogg Signal vs Noise model is shown in two parts:

    • Blue for inliers fits the ‘main axis’ of datapoints well, ignoring outliers

    • Red for outliers has a very large variance and has assigned ‘outlier’ points with more log likelihood than the Blue inlier model

We see that the Hogg Signal vs Noise model also yields specific estimates of which datapoints are outliers:

  • 17 ‘inlier’ datapoints, in Blue and

  • 3 ‘outlier’ datapoints shown in Red.

  • From a simple visual inspection, the classification seems fair, and agrees with Jake Vanderplas’ findings.

  • I’ve annotated these Red and the most outlying inliers to aid visual investigation

Overall:

  • the Hogg Signal vs Noise model behaves as promised, yielding a robust regression estimate and explicit labelling of inliers / outliers, but

  • the Hogg Signal vs Noise model is quite complex, and whilst the regression seems robust, the traceplot shoes many divergences, and the model is potentially unstable

  • if you simply want a robust regression without inlier / outlier labelling, the Student-T model may be a good compromise, offering a simple model, quick sampling, and a very similar estimate.

References#

[1] (1,2)

David W. Hogg, Jo Bovy, and Dustin Lang. Data analysis recipes: fitting a model to data. 2010. arXiv:1008.4686.

[2]

Željko Ivezić, Andrew J. Connolly, Jacob T. VanderPlas, and Alexander Gray. Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. Princeton University Press, 2014. ISBN 9781400848911. doi:10.1515/9781400848911.

[3]

J.T. Vanderplas, A.J. Connolly, Ž. Ivezić, and A. Gray. Introduction to astroml: machine learning for astrophysics. In Conference on Intelligent Data Understanding (CIDU), 47–54. oct. 2012. doi:10.1109/CIDU.2012.6382200.

[4]

Andrew Gelman. Scaling regression inputs by dividing by two standard deviations. Statistics in medicine, 27(15):2865–2873, 2008. doi:10.1002/sim.3107.

Authors#

  • Authored and adapted for this collection by Jon Sedar (jonsedar) on December, 2015. It was originally posted in jonsedar/pymc3_examples

  • Updated by Thomas Wiecki (twiecki) on July, 2018

    • Restate outlier model using pm.Normal.dist().logp() and pm.Potential()

  • Updated by Jon Sedar on November, 2019

    • Restate nu in StudentT model to be more efficient, drop explicit use of theano shared vars, generally improve plotting / explanations / layout

  • Updated by Jon Sedar on May, 2020

    • Tidyup language, formatting, plots and warnings and rerun with pymc=3.8, arviz=0.7

  • Updated by Raul Maldonado (CloudChaoszero) on April, 2021

    • Tidyup language, formatting, set MultiTrace objects to arviz.InferenceData objects, running on pymc=3.11, arviz=0.11.0

  • Updated by Raul Maldonado on May, 2021

    • Update Visualizations from Matplotlib explicit calls to Arviz visualizations. objects, running on pymc=3.11, arviz=0.11.0

  • Updated by Oriol Abril on November, 2021

  • Updated to PyMC v5 and to use az.extract by Benjamin Vincent in February, 2023 (pymc-examples#522)

Watermark#

%load_ext watermark
%watermark -n -u -v -iv -w -p theano,xarray
Last updated: Sun Feb 05 2023

Python implementation: CPython
Python version       : 3.10.9
IPython version      : 8.9.0

theano: not installed
xarray: 2023.1.0

numpy     : 1.24.1
arviz     : 0.14.0
pandas    : 1.5.3
pymc      : 5.0.1
seaborn   : 0.12.2
scipy     : 1.10.0
matplotlib: 3.6.3
xarray    : 2023.1.0

Watermark: 2.3.1

License notice#

All the notebooks in this example gallery are provided under the MIT License which allows modification, and redistribution for any use provided the copyright and license notices are preserved.

Citing PyMC examples#

To cite this notebook, use the DOI provided by Zenodo for the pymc-examples repository.

Important

Many notebooks are adapted from other sources: blogs, books… In such cases you should cite the original source as well.

Also remember to cite the relevant libraries used by your code.

Here is an citation template in bibtex:

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

which once rendered could look like: