Posts in beginner
Simpson’s paradox
- 04 September 2024
Simpson’s Paradox describes a situation where there might be a negative relationship between two variables within a group, but when data from multiple groups are combined, that relationship may disappear or even reverse sign. The gif below (from the Simpson’s Paradox Wikipedia page) demonstrates this very nicely.
The prevalence of malaria in the Gambia
- 24 August 2024
Duplicate implicit target name: “the prevalence of malaria in the gambia”.
Categorical regression
- 04 May 2024
In this example, we will model outcomes with more than two categories.
Out-Of-Sample Predictions
- 04 December 2023
We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.
GLM: Negative Binomial Regression
- 04 September 2023
This notebook uses libraries that are not PyMC dependencies and therefore need to be installed specifically to run this notebook. Open the dropdown below for extra guidance.
Interventional distributions and graph mutation with the do-operator
- 04 July 2023
PyMC is a pivotal component of the open source Bayesian statistics ecosystem. It helps solve real problems across a wide range of industries and academic research areas every day. And it has gained this level of utility by being accessible, powerful, and practically useful at solving Bayesian statistical inference problems.
Regression Models with Ordered Categorical Outcomes
- 04 April 2023
Like many areas of statistics the language of survey data comes with an overloaded vocabulary. When discussing survey design you will often hear about the contrast between design based and model based approaches to (i) sampling strategies and (ii) statistical inference on the associated data. We won’t wade into the details about different sample strategies such as: simple random sampling, cluster random sampling or stratified random sampling using population weighting schemes. The literature on each of these is vast, but in this notebook we’ll talk about when any why it’s useful to apply model driven statistical inference to Likert scaled survey response data and other kinds of ordered categorical data.
Multivariate Gaussian Random Walk
- 02 February 2023
This notebook shows how to fit a correlated time series using multivariate Gaussian random walks (GRWs). In particular, we perform a Bayesian regression of the time series data against a model dependent on GRWs.
GLM: Robust Linear Regression
- 10 January 2023
Duplicate implicit target name: “glm: robust linear regression”.
Bayes Factors and Marginal Likelihood
- 10 January 2023
The “Bayesian way” to compare models is to compute the marginal likelihood of each model \(p(y \mid M_k)\), i.e. the probability of the observed data \(y\) given the \(M_k\) model. This quantity, the marginal likelihood, is just the normalizing constant of Bayes’ theorem. We can see this if we write Bayes’ theorem and make explicit the fact that all inferences are model-dependant.
Modeling Heteroscedasticity with BART
- 04 January 2023
In this notebook we show how to use BART to model heteroscedasticity as described in Section 4.1 of pymc-bart
’s paper [Quiroga et al., 2022]. We use the marketing
data set provided by the R package datarium
[Kassambara, 2019]. The idea is to model a marketing channel contribution to sales as a function of budget.
Generalized Extreme Value Distribution
- 27 September 2022
The Generalized Extreme Value (GEV) distribution is a meta-distribution containing the Weibull, Gumbel, and Frechet families of extreme value distributions. It is used for modelling the distribution of extremes (maxima or minima) of stationary processes, such as the annual maximum wind speed, annual maximum truck weight on a bridge, and so on, without needing a priori decision on the tail behaviour.
Bayesian regression with truncated or censored data
- 04 September 2022
The notebook provides an example of how to conduct linear regression when your outcome variable is either censored or truncated.
How to debug a model
- 02 August 2022
There are various levels on which to debug a model. One of the simplest is to just print out the values that different variables are taking on.
Conditional Autoregressive (CAR) Models for Spatial Data
- 29 July 2022
This notebook uses libraries that are not PyMC dependencies and therefore need to be installed specifically to run this notebook. Open the dropdown below for extra guidance.
Stochastic Volatility model
- 17 June 2022
Asset prices have time-varying volatility (variance of day over day returns
). In some periods, returns are highly variable, while in others very stable. Stochastic volatility models model this with a latent volatility variable, modeled as a stochastic process. The following model is similar to the one described in the No-U-Turn Sampler paper, [Hoffman and Gelman, 2014].
Splines
- 04 June 2022
Often, the model we want to fit is not a perfect line between some \(x\) and \(y\). Instead, the parameters of the model are expected to vary over \(x\). There are multiple ways to handle this situation, one of which is to fit a spline. Spline fit is effectively a sum of multiple individual curves (piecewise polynomials), each fit to a different section of \(x\), that are tied together at their boundaries, often called knots.
Sampler Statistics
- 31 May 2022
When checking for convergence or when debugging a badly behaving sampler, it is often helpful to take a closer look at what the sampler is doing. For this purpose some samplers export statistics for each generated sample.
General API quickstart
- 31 May 2022
Models in PyMC are centered around the Model
class. It has references to all random variables (RVs) and computes the model logp and its gradients. Usually, you would instantiate it as part of a with
context:
Approximate Bayesian Computation
- 31 May 2022
Approximate Bayesian Computation methods (also called likelihood free inference methods), are a group of techniques developed for inferring posterior distributions in cases where the likelihood function is intractable or costly to evaluate. This does not mean that the likelihood function is not part of the analysis, it just the we are approximating the likelihood, and hence the name of the ABC methods.
Regression discontinuity design analysis
- 04 April 2022
Quasi experiments involve experimental interventions and quantitative measures. However, quasi-experiments do not involve random assignment of units (e.g. cells, people, companies, schools, states) to test or control groups. This inability to conduct random assignment poses problems when making causal claims as it makes it harder to argue that any difference between a control and test group are because of an intervention and not because of a confounding factor.
Gaussian Mixture Model
- 04 April 2022
A mixture model allows us to make inferences about the component contributors to a distribution of data. More specifically, a Gaussian Mixture Model allows us to make inferences about the means and standard deviations of a specified number of underlying component Gaussian distributions.
Bayesian moderation analysis
- 04 March 2022
This notebook covers Bayesian moderation analysis. This is appropriate when we believe that one predictor variable (the moderator) may influence the linear relationship between another predictor variable and an outcome. Here we look at an example where we look at the relationship between hours of training and muscle mass, where it may be that age (the moderating variable) affects this relationship.
Lasso regression with block updating
- 10 February 2022
Sometimes, it is very useful to update a set of parameters together. For example, variables that are highly correlated are often good to update together. In PyMC block updating is simple. This will be demonstrated using the parameter step
of pymc.sample
.
Binomial regression
- 04 February 2022
This notebook covers the logic behind Binomial regression, a specific instance of Generalized Linear Modelling. The example is kept very simple, with a single predictor variable.
Bayesian mediation analysis
- 04 February 2022
This notebook covers Bayesian mediation analysis. This is useful when we want to explore possible mediating pathways between a predictor and an outcome variable.
Bayesian Estimation Supersedes the T-Test
- 07 January 2022
Non-consecutive header level increase; H1 to H3 [myst.header]
Using Data Containers
- 16 December 2021
After building the statistical model of your dreams, you’re going to need to feed it some data. Data is typically introduced to a PyMC model in one of two ways. Some data is used as an exogenous input, called X
in linear regression models, where mu = X @ beta
. Other data are “observed” examples of the endogenous outputs of your model, called y
in regression models, and is used as input to the likelihood function implied by your model. These data, either exogenous or endogenous, can be included in your model as wide variety of datatypes, including numpy ndarrays
, pandas Series
and DataFrame
, and even pytensor TensorVariables
.
Sequential Monte Carlo
- 19 October 2021
Sampling from distributions with multiple peaks with standard MCMC methods can be difficult, if not impossible, as the Markov chain often gets stuck in either of the minima. A Sequential Monte Carlo sampler (SMC) is a way to ameliorate this problem.
Introduction to Bayesian A/B Testing
- 23 May 2021
This notebook demonstrates how to implement a Bayesian analysis of an A/B test. We implement the models discussed in VWO’s Bayesian A/B Testing Whitepaper [Stucchio, 2015], and discuss the effect of different prior choices for these models. This notebook does not discuss other related topics like how to choose a prior, early stopping, and power analysis.