Gaussian Process for CO2 at Mauna Loa#

This Gaussian Process (GP) example shows how to:

  • Design combinations of covariance functions

  • Use additive GPs whose individual components can be used for prediction

  • Perform maximum a-posteriori (MAP) estimation

Since the late 1950’s, the Mauna Loa observatory has been taking regular measurements of atmospheric CO\(_2\). In the late 1950’s Charles Keeling invented a accurate way to measure atmospheric CO\(_2\) concentration. Since then, CO\(_2\) measurements have been recorded nearly continuously at the Mauna Loa observatory. Check out last hours measurement result here.

Not much was known about how fossil fuel burning influences the climate in the late 1950s. The first couple years of data collection showed that CO\(_2\) levels rose and fell following summer and winter, tracking the growth and decay of vegetation in the northern hemisphere. As multiple years passed, the steady upward trend increasingly grew into focus. With over 70 years of collected data, the Keeling curve is one of the most important climate indicators.

The history behind these measurements and their influence on climatology today and other interesting reading:

  • http://scrippsco2.ucsd.edu/history_legacy/early_keeling_curve#

  • https://scripps.ucsd.edu/programs/keelingcurve/2016/05/23/why-has-a-drop-in-global-co2-emissions-not-caused-co2-levels-in-the-atmosphere-to-stabilize/#more-1412

Let’s load in the data, tidy it up, and have a look. The raw data set is located here. This notebook uses the Bokeh package for plots that benefit from interactivity.

Preparing the data#

Attention

This notebook uses libraries that are not PyMC dependencies and therefore need to be installed specifically to run this notebook. Open the dropdown below for extra guidance.

Extra dependencies install instructions

In order to run this notebook (either locally or on binder) you won’t only need a working PyMC installation with all optional dependencies, but also to install some extra dependencies. For advise on installing PyMC itself, please refer to Installation

You can install these dependencies with your preferred package manager, we provide as an example the pip and conda commands below.

$ pip install bokeh

Note that if you want (or need) to install the packages from inside the notebook instead of the command line, you can install the packages by running a variation of the pip command:

import sys

!{sys.executable} -m pip install bokeh

You should not run !pip install as it might install the package in a different environment and not be available from the Jupyter notebook even if installed.

Another alternative is using conda instead:

$ conda install bokeh

when installing scientific python packages with conda, we recommend using conda forge

import numpy as np
import pandas as pd
import pymc3 as pm

from bokeh.io import output_notebook
from bokeh.models import BoxAnnotation, Label, Legend, Span
from bokeh.palettes import brewer
from bokeh.plotting import figure, show

output_notebook()
Loading BokehJS ...
%config InlineBackend.figure_format = 'retina'
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
# get data
try:
    data_monthly = pd.read_csv("../data/monthly_in_situ_co2_mlo.csv", header=56)
except FileNotFoundError:
    data_monthly = pd.read_csv(pm.get_data("monthly_in_situ_co2_mlo.csv"), header=56)

# replace -99.99 with NaN
data_monthly.replace(to_replace=-99.99, value=np.nan, inplace=True)

# fix column names
cols = [
    "year",
    "month",
    "--",
    "--",
    "CO2",
    "seasonaly_adjusted",
    "fit",
    "seasonally_adjusted_fit",
    "CO2_filled",
    "seasonally_adjusted_filled",
]
data_monthly.columns = cols
cols.remove("--")
cols.remove("--")
data_monthly = data_monthly[cols]

# drop rows with nan
data_monthly.dropna(inplace=True)

# fix time index
data_monthly["day"] = 15
data_monthly.index = pd.to_datetime(data_monthly[["year", "month", "day"]])
cols.remove("year")
cols.remove("month")
data_monthly = data_monthly[cols]

data_monthly.head(5)
CO2 seasonaly_adjusted fit seasonally_adjusted_fit CO2_filled seasonally_adjusted_filled
1958-03-15 315.71 314.44 316.19 314.91 315.71 314.44
1958-04-15 317.45 315.16 317.30 314.99 317.45 315.16
1958-05-15 317.51 314.70 317.87 315.07 317.51 314.70
1958-07-15 315.86 315.20 315.86 315.22 315.86 315.20
1958-08-15 314.93 316.20 313.98 315.29 314.93 316.20
# function to convert datetimes to indexed numbers that are useful for later prediction
def dates_to_idx(timelist):
    reference_time = pd.to_datetime("1958-03-15")
    t = (timelist - reference_time) / pd.Timedelta(365, "D")
    return np.asarray(t)


t = dates_to_idx(data_monthly.index)

# normalize CO2 levels
y = data_monthly["CO2"].values
first_co2 = y[0]
std_co2 = np.std(y)
y_n = (y - first_co2) / std_co2

data_monthly = data_monthly.assign(t=t)
data_monthly = data_monthly.assign(y_n=y_n)

This data might be familiar to you, since it was used as an example in the Gaussian Processes for Machine Learning book by Rasmussen and Williams [2006]. The version of the data set they use starts in the late 1950’s, but stops at the end of 2003. So that our PyMC3 example is somewhat comparable to their example, we use the stretch of data from before 2004 as the “training” set. The data from 2004 to 2022 we’ll use to test our predictions.

# split into training and test set
sep_idx = data_monthly.index.searchsorted(pd.to_datetime("2003-12-15"))
data_early = data_monthly.iloc[: sep_idx + 1, :]
data_later = data_monthly.iloc[sep_idx:, :]
# plot training and test data
p = figure(
    x_axis_type="datetime",
    title="Monthly CO2 Readings from Mauna Loa",
    plot_width=550,
    plot_height=350,
)
p.yaxis.axis_label = "CO2 [ppm]"
p.xaxis.axis_label = "Date"
predict_region = BoxAnnotation(
    left=pd.to_datetime("2003-12-15"), fill_alpha=0.1, fill_color="firebrick"
)
p.add_layout(predict_region)
ppm400 = Span(location=400, dimension="width", line_color="red", line_dash="dashed", line_width=2)
p.add_layout(ppm400)

p.line(data_monthly.index, data_monthly["CO2"], line_width=2, line_color="black", alpha=0.5)
p.circle(data_monthly.index, data_monthly["CO2"], line_color="black", alpha=0.1, size=2)

train_label = Label(
    x=100,
    y=165,
    x_units="screen",
    y_units="screen",
    text="Training Set",
    render_mode="css",
    border_line_alpha=0.0,
    background_fill_alpha=0.0,
)
test_label = Label(
    x=585,
    y=80,
    x_units="screen",
    y_units="screen",
    text="Test Set",
    render_mode="css",
    border_line_alpha=0.0,
    background_fill_alpha=0.0,
)

p.add_layout(train_label)
p.add_layout(test_label)
show(p)

Bokeh plots are interactive, so panning and zooming can be done with the sidebar on the right hand side. The seasonal rise and fall is plainly apparent, as is the upward trend. Here is a link to an plots of this curve at different time scales, and in the context of historical ice core data.

The 400 ppm level is highlighted with a dashed line. In addition to fitting a descriptive model, our goal will be to predict the first month the 400 ppm threshold is crossed, which was May, 2013. In the data set above, the CO\(_2\) average reading for May, 2013 was about 399.98, close enough to be our correct target date.

Modeling the Keeling Curve using GPs#

As a starting point, we use the GP model described in Rasmussen and Williams [2006]. Instead of using flat priors on covariance function hyperparameters and then maximizing the marginal likelihood like is done in the textbook, we place somewhat informative priors on the hyperparameters and use optimization to find the MAP point. We use the gp.Marginal since Gaussian noise is assumed.

The R&W [Rasmussen and Williams, 2006] model is a sum of three GPs for the signal, and one GP for the noise.

  1. A long term smooth rising trend represented by an exponentiated quadratic kernel.

  2. A periodic term that decays away from exact periodicity. This is represented by the product of a Periodic and a Matern52 covariance functions.

  3. Small and medium term irregularities with a rational quadratic kernel.

  4. The noise is modeled as the sum of a Matern32 and a white noise kernel.

The prior on CO\(_2\) as a function of time is,

\[ f(t) \sim \mathcal{GP}_{\text{slow}}(0,\, k_1(t, t')) + \mathcal{GP}_{\text{med}}(0,\, k_2(t, t')) + \mathcal{GP}_{\text{per}}(0,\, k_3(t, t')) + \mathcal{GP}_{\text{noise}}(0,\, k_4(t, t')) \]

Hyperparameter priors#

We use fairly uninformative priors for the scale hyperparameters of the covariance functions, and informative Gamma parameters for lengthscales. The PDFs used for the lengthscale priors is shown below:

x = np.linspace(0, 150, 5000)
priors = [
    ("ℓ_pdecay", pm.Gamma.dist(alpha=10, beta=0.075)),
    ("ℓ_psmooth", pm.Gamma.dist(alpha=4, beta=3)),
    ("period", pm.Normal.dist(mu=1.0, sigma=0.05)),
    ("ℓ_med", pm.Gamma.dist(alpha=2, beta=0.75)),
    ("α", pm.Gamma.dist(alpha=5, beta=2)),
    ("ℓ_trend", pm.Gamma.dist(alpha=4, beta=0.1)),
    ("ℓ_noise", pm.Gamma.dist(alpha=2, beta=4)),
]

colors = brewer["Paired"][7]

p = figure(
    title="Lengthscale and period priors",
    plot_width=550,
    plot_height=350,
    x_range=(-1, 8),
    y_range=(0, 2),
)
p.yaxis.axis_label = "Probability"
p.xaxis.axis_label = "Years"

for i, prior in enumerate(priors):
    p.line(
        x,
        np.exp(prior[1].logp(x).eval()),
        legend_label=prior[0],
        line_width=3,
        line_color=colors[i],
    )
show(p)
  • ℓ_pdecay: The periodic decay. The smaller this parameter is, the faster the periodicity goes away. I doubt that the seasonality of the CO\(_2\) will be going away any time soon (hopefully), and there’s no evidence for that in the data. Most of the prior mass is from 60 to >140 years.

  • ℓ_psmooth: The smoothness of the periodic component. It controls how “sinusoidal” the periodicity is. The plot of the data shows that seasonality is not an exact sine wave, but its not terribly different from one. We use a Gamma whose mode is at one, and doesn’t have too large of a variance, with most of the prior mass from around 0.5 and 2.

  • period: The period. We put a very strong prior on \(p\), the period that is centered at one. R&W fix \(p=1\), since the period is annual.

  • ℓ_med: This is the lengthscale for the short to medium long variations. This prior has most of its mass below 6 years.

  • α: This is the shape parameter. This prior is centered at 3, since we’re expecting there to be some more variation than could be explained by an exponentiated quadratic.

  • ℓ_trend: The lengthscale of the long term trend. It has a wide prior with mass on a decade scale. Most of the mass is between 10 to 60 years.

  • ℓ_noise: The lengthscale of the noise covariance. This noise should be very rapid, in the scale of several months to at most a year or two.

We know beforehand which GP components should have a larger magnitude, so we include this information in the scale parameters.

x = np.linspace(0, 4, 5000)
priors = [
    ("η_per", pm.HalfCauchy.dist(beta=2)),
    ("η_med", pm.HalfCauchy.dist(beta=1.0)),
    (
        "η_trend",
        pm.HalfCauchy.dist(beta=3),
    ),  # will use beta=2, but beta=3 is visible on plot
    ("σ", pm.HalfNormal.dist(sigma=0.25)),
    ("η_noise", pm.HalfNormal.dist(sigma=0.5)),
]

colors = brewer["Paired"][5]

p = figure(title="Scale priors", plot_width=550, plot_height=350)
p.yaxis.axis_label = "Probability"
p.xaxis.axis_label = "Years"

for i, prior in enumerate(priors):
    p.line(
        x,
        np.exp(prior[1].logp(x).eval()),
        legend_label=prior[0],
        line_width=3,
        line_color=colors[i],
    )
show(p)