{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(data_container)=\n", "# Using Data Containers\n", "\n", ":::{post} Dec 16, 2021\n", ":tags: posterior predictive, shared data \n", ":category: beginner\n", ":author: Juan Martin Loyola, Kavya Jaiswal, Oriol Abril, Jesse Grabowski\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running on PyMC v5.26.1\n" ] } ], "source": [ "import arviz.preview as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import xarray as xr\n", "\n", "from numpy.random import default_rng\n", "\n", "plt.rcParams[\"figure.constrained_layout.use\"] = True\n", "\n", "print(f\"Running on PyMC v{pm.__version__}\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "RANDOM_SEED = sum(map(ord, \"Data Containers in PyMC\"))\n", "rng = default_rng(RANDOM_SEED)\n", "az.style.use(\"arviz-variat\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "After building the statistical model of your dreams, you're going to need to feed it some data. Data is typically introduced to a PyMC model in one of two ways. Some data is used as an exogenous input, called `X` in linear regression models, where `mu = X @ beta`. Other data are \"observed\" examples of the endogenous outputs of your model, called `y` in regression models, and is used as input to the likelihood function implied by your model. These data, either exogenous or endogenous, can be included in your model as wide variety of datatypes, including numpy `ndarrays`, pandas `Series` and `DataFrame`, and even pytensor `TensorVariables`. \n", "\n", "Although you can pass these \"raw\" datatypes to your PyMC model, the best way to introduce data into your model is to use {class}`pymc.Data` containers. These containers make it extremely easy to work with data in a PyMC model. They offer a range of benefits, including:\n", "\n", "1. Visualization of data as a component of your probabilistic graph\n", "2. Access to labeled dimensions for readability and accessibility\n", "3. Support for swapping out data for out-of-sample prediction, interpolation/extrapolation, forecasting, etc.\n", "4. All data will be stored in your {class}`arviz.InferenceData`, which is useful for plotting and reproducible workflows.\n", "\n", "This notebook will illustrate each of these benefits in turn, and show you the best way to integrate data into your PyMC modeling workflow. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{important}\n", "In past versions of PyMC, there were two types of data containers {func}`pymc.MutableData` and {func}`pymc.ConstantData`. These have been deprecated as all data containers are mutable now.\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Data Containers for readability and reproducibility\n", "\n", "The example shows some of the differences between using a data container and \"raw\" data. This first model shows how raw data, in this case a numpy arrays, can be directly provided to a PyMC model." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Initializing NUTS using jitter+adapt_diag...\n", "Multiprocess sampling (4 chains in 4 jobs)\n", "NUTS: [beta, sigma]\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1ac18752275d465cb733225574b45a28", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1 seconds.\n" ] } ], "source": [ "true_beta = 3\n", "true_std = 5\n", "n_obs = 100\n", "x = rng.normal(size=n_obs)\n", "y = rng.normal(loc=true_beta * x, scale=true_std, size=n_obs)\n", "\n", "with pm.Model() as no_data_model:\n", " beta = pm.Normal(\"beta\")\n", " mu = pm.Deterministic(\"mu\", beta * x)\n", " sigma = pm.Exponential(\"sigma\", 1)\n", " obs = pm.Normal(\"obs\", mu=mu, sigma=sigma, observed=y)\n", " idata = pm.sample(random_seed=RANDOM_SEED)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the resulting computational graph, the `obs` node is shaded gray to indicate that it has observed data, in this case `y`. But the data itself is not shown on the graph, so there's no hint about what data has been observed. In addition, the `x` data doesn't appear in the graph anywhere, so it's not obvious that this model used exogenous data as an input." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "<xarray.Dataset> Size: 2kB\n",
"Dimensions: (obs_dim_0: 100)\n",
"Coordinates:\n",
" * obs_dim_0 (obs_dim_0) int64 800B 0 1 2 3 4 5 6 7 ... 93 94 95 96 97 98 99\n",
"Data variables:\n",
" obs (obs_dim_0) float64 800B -0.3966 -3.337 -7.844 ... -6.549 -0.8598\n",
"Attributes:\n",
" created_at: 2025-12-15T10:11:33.015498+00:00\n",
" arviz_version: 0.23.0.dev0\n",
" inference_library: pymc\n",
" inference_library_version: 5.26.1<xarray.Dataset> Size: 3kB\n",
"Dimensions: (x_data_dim_0: 100, y_data_dim_0: 100)\n",
"Coordinates:\n",
" * x_data_dim_0 (x_data_dim_0) int64 800B 0 1 2 3 4 5 6 ... 94 95 96 97 98 99\n",
" * y_data_dim_0 (y_data_dim_0) int64 800B 0 1 2 3 4 5 6 ... 94 95 96 97 98 99\n",
"Data variables:\n",
" x_data (x_data_dim_0) float64 800B -1.383 -0.2725 ... -1.745 -0.5087\n",
" y_data (y_data_dim_0) float64 800B -0.3966 -3.337 ... -6.549 -0.8598\n",
"Attributes:\n",
" created_at: 2025-12-15T10:11:34.965262+00:00\n",
" arviz_version: 0.23.0.dev0\n",
" inference_library: pymc\n",
" inference_library_version: 5.26.1| \n", " | Berlin | \n", "San Marino | \n", "Paris | \n", "
|---|---|---|---|
| date | \n", "\n", " | \n", " | \n", " |
| 2020-05-01 | \n", "15.401536 | \n", "18.817801 | \n", "16.836690 | \n", "
| 2020-05-02 | \n", "13.575241 | \n", "17.441153 | \n", "14.407089 | \n", "
| 2020-05-03 | \n", "14.808934 | \n", "19.890369 | \n", "15.616649 | \n", "
| 2020-05-04 | \n", "16.071487 | \n", "18.407539 | \n", "15.396678 | \n", "
| 2020-05-05 | \n", "15.505263 | \n", "17.621143 | \n", "16.723544 | \n", "