{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "(GLM-robust-with-outlier-detection)=\n", "# GLM: Robust Regression using Custom Likelihood for Outlier Classification\n", "\n", ":::{post} 17 Nov, 2021\n", ":tags: outliers, regression, robust\n", ":category: intermediate\n", ":author: Jon Sedar, Thomas Wiecki, Raul Maldonado, Oriol Abril\n", ":::\n", "\n", "Using PyMC for Robust Regression with Outlier Detection using the Hogg 2010 Signal vs Noise method. \n", "\n", "**Modelling concept:**\n", "+ This model uses a custom likelihood function as a mixture of two likelihoods, one for the main data-generating function (a linear model that we care about), and one for outliers.\n", "+ The model does not marginalize and thus gives us a classification of outlier-hood for each datapoint\n", "+ The dataset is tiny and hardcoded into this Notebook. It contains errors in both the x and y, but we will deal here with only errors in y.\n", "\n", "**Complementary approaches:**\n", "+ This is a complementary approach to the Student-T robust regression as illustrated in the example {doc}`generalized_linear_models/GLM-robust`, and that approach is also compared\n", "+ See also a [gist by Dan FM](https://gist.github.com/dfm/5250dd2f17daf60cbe582ceeeb2fd12f) that he published after a quick twitter conversation - his \"Hogg improved\" model uses this same model structure and cleverly marginalizes over the outlier class but also observes it during sampling using a `pm.Deterministic` <- this is really nice\n", "+ The likelihood evaluation is essentially a copy of eqn 17 in \"Data analysis recipes: Fitting a model to data\" - {cite:t}`hogg2010data`\n", "+ The model is adapted specifically from Jake Vanderplas' and Brigitta Sipocz' [implementation](http://www.astroml.org/book_figures/chapter8/fig_outlier_rejection.html) in the AstroML book {cite:p}`ivezić2014astroMLtext,vanderplas2012astroML`" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installation Notes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See the original project [README](https://github.com/jonsedar/pymc3_examples/blob/master/README.md) for full details on dependencies and about the environment where the notebook was written in. A summary on the environment where this notebook was executed is available in the [\"Watermark\"](#watermark) section.\n", "\n", ":::{include} ../extra_installs.md\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import seaborn as sns\n", "\n", "from matplotlib.lines import Line2D\n", "from scipy import stats" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "az.style.use(\"arviz-darkgrid\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the Hogg 2010 data available at https://github.com/astroML/astroML/blob/master/astroML/datasets/hogg2010test.py\n", "\n", "It's a very small dataset so for convenience, it's hardcoded below" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | x | \n", "y | \n", "sigma_y | \n", "sigma_x | \n", "rho_xy | \n", "
---|---|---|---|---|---|
id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
p1 | \n", "201.0 | \n", "592.0 | \n", "61.0 | \n", "9.0 | \n", "-0.84 | \n", "
p2 | \n", "244.0 | \n", "401.0 | \n", "25.0 | \n", "4.0 | \n", "0.31 | \n", "
p3 | \n", "47.0 | \n", "583.0 | \n", "38.0 | \n", "11.0 | \n", "0.64 | \n", "
p4 | \n", "287.0 | \n", "402.0 | \n", "15.0 | \n", "7.0 | \n", "-0.27 | \n", "
p5 | \n", "203.0 | \n", "495.0 | \n", "21.0 | \n", "5.0 | \n", "-0.33 | \n", "