Why can't the posterior be calculated directly?

Bayes' theorem requires dividing by the marginal likelihood, which is an integral of the prior times the likelihood over every possible parameter value. For models with more than a few parameters, or non-conjugate priors, that integral has no closed form and direct numerical integration is too slow.

What is the Metropolis-Hastings algorithm?

Metropolis-Hastings proposes a candidate next state from a proposal distribution, then accepts or rejects it based on the ratio of the unnormalized posterior at the candidate and current states. It is the original MCMC algorithm and the basis for most others.

What is Gibbs sampling?

Gibbs sampling updates each parameter in turn by drawing it from its full conditional distribution given the current values of all other parameters. Every proposal is accepted, which makes it efficient for conjugate or near-conjugate models.

What is Hamiltonian Monte Carlo (HMC)?

HMC uses the gradient of the log-posterior to simulate a physical system with position and momentum, allowing the sampler to propose distant moves along high-probability regions instead of taking small random-walk steps. This makes it far more efficient in high dimensions.

What is the No-U-Turn Sampler (NUTS)?

NUTS is an extension of HMC that automatically chooses how far each step should travel by building a tree of candidate positions and stopping when the trajectory starts to double back. It removes the need to manually tune the number of leapfrog steps and is the default sampler in Stan and PyMC.

How do you check if an MCMC chain has converged?

Run multiple chains from different starting points and check three things: trace plots should look like overlapping random noise with no trend, the Gelman-Rubin statistic (R-hat) should be close to 1, and the effective sample size should be large enough relative to the number of iterations.

What software is used to run MCMC models?

Stan and PyMC are the most widely used tools for HMC and NUTS-based sampling, while JAGS and the older WinBUGS use Gibbs and slice sampling for conjugate-style models. NumPyro and TensorFlow Probability provide GPU-accelerated NUTS implementations for Python.

What is effective sample size in MCMC?

Because MCMC samples are correlated with their neighbors, a chain of N samples carries less information than N independent draws. Effective sample size estimates how many independent samples would carry the same information, accounting for that autocorrelation.

Markov Chain Monte Carlo (MCMC): Complete Guide to Bayesian Sampling (2026)

Q: What is burn-in in MCMC?

Burn-in (also called warmup) refers to the early iterations of a chain, taken before it has reached the high-probability region of the target distribution. These samples are biased by the arbitrary starting point and are discarded before computing summaries.

What Is Markov Chain Monte Carlo (MCMC)?

Definition — Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) is a class of algorithms for drawing samples from a probability distribution that can't be evaluated directly. Each new sample depends only on the one before it — a Markov chain — and the rule for generating it is built so that, after enough steps, the chain's samples follow the target distribution. In statistics, that target is almost always a Bayesian posterior.

θ⁽ᵗ⁺¹⁾ ~ T(θ | θ⁽ᵗ⁾), where T's stationary distribution is π(θ)

The name combines two ideas that each predate modern computing. A Monte Carlo method approximates a quantity using random sampling instead of exact calculation — the name comes from the Monte Carlo casino, a nod to its reliance on chance. A Markov chain is a sequence of random variables where the next value depends only on the current value, not on the full history that came before it. Background on both ideas is covered in the guides on basic probability and Bayes' theorem.

What makes MCMC useful is that it never needs to know the target distribution's exact height, only its shape. Given any function proportional to the posterior, an MCMC algorithm can wander through parameter space, spending more time in regions where that function is large and less time where it's small. After enough steps, a histogram of where the chain has visited approximates the posterior distribution — and from that histogram you can read off means, standard deviations, and credible intervals, the Bayesian counterpart of the confidence interval for a mean.

⚡ Quick Reference — MCMC Key Facts

Posterior: P(θ|D) ∝ P(D|θ)P(θ) — MCMC samples from the right-hand side without ever computing the normalizing constant
Markov chain: each sample depends only on the current state, not the full history
Stationary distribution: the distribution the chain converges to — the target posterior
Burn-in / warmup: early samples discarded before the chain reaches the high-probability region
Main algorithms: Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo, NUTS
Convergence checks: trace plots, the Gelman-Rubin statistic (R-hat), and effective sample size

The Problem MCMC Solves: Bayes' Theorem's Denominator

Bayes' theorem says the posterior distribution of a parameter θ, given data D, equals the likelihood times the prior, divided by a normalizing constant:

Bayes' Theorem

P(θ|D) = P(D|θ) P(θ) / ∫ P(D|θ) P(θ) dθ

θ = parameters being estimated D = observed data P(θ) = prior distribution P(D|θ) = likelihood P(θ|D) = posterior distribution

The numerator, P(D|θ)P(θ), is usually easy to compute for any specific value of θ — it's just the prior density times the likelihood of the data under that θ. The denominator is the problem. It's the integral of that same product over every possible value of θ, and it's what turns the unnormalized shape on top into a true probability distribution that sums to one.

For a single parameter with a conjugate prior — a Beta prior paired with a Binomial likelihood, for example — that integral has a known closed-form answer, and the posterior is itself a Beta distribution. Once a model has more than two or three parameters, or the prior and likelihood aren't conjugate, the integral typically has no algebraic solution. Direct numerical integration (evaluating the function on a grid and summing) becomes impractical too: a grid with just 100 points per dimension already needs 100 evaluations for one parameter, 10,000 for two, and 10^20 for ten — a version of the curse of dimensionality.

MCMC sidesteps the integral entirely. Because every MCMC algorithm only ever needs the ratio of the posterior at two points, the denominator cancels out:

Why the Normalizing Constant Cancels

P(θ*|D) / P(θ⁽ᵗ⁾|D) = [P(D|θ*)P(θ*)] / [P(D|θ⁽ᵗ⁾)P(θ⁽ᵗ⁾)]

The denominator ∫P(D|θ)P(θ)dθ appears in both the top and bottom and divides out

That single fact is what makes the rest of this guide possible. Every algorithm below works with the unnormalized posterior — the prior times the likelihood — and never needs to know how that quantity relates to a properly scaled probability distribution.

A Short History: From Los Alamos to Stan

MCMC didn't start as a statistics tool. In 1953, a team at Los Alamos National Laboratory — Nicholas Metropolis, Arianna and Marshall Rosenbluth, and Augusta and Edward Teller — published "Equation of State Calculations by Fast Computing Machines" in the Journal of Chemical Physics. The paper described what's now called the Metropolis algorithm, used to simulate how particles in a fluid settle into equilibrium on one of the earliest electronic computers, the MANIAC.

In 1970, statistician W.K. Hastings generalized the method in Biometrika, allowing the proposal step to be asymmetric — the version now known as Metropolis-Hastings. In 1984, Stuart and Donald Geman introduced Gibbs sampling for image restoration in a paper for IEEE Transactions on Pattern Analysis and Machine Intelligence, borrowing the name from physicist Josiah Willard Gibbs.

The turning point for statistics came in 1990, when Alan Gelfand and Adrian Smith published "Sampling-Based Approaches to Calculating Marginal Densities" in the Journal of the American Statistical Association. The paper showed that Gibbs sampling could solve Bayesian computation problems that had been considered intractable, and is widely credited with starting the modern era of applied Bayesian statistics. Through the 1990s, the BUGS and WinBUGS software packages made Gibbs sampling accessible to applied researchers without a programming background, and the open-source JAGS followed as a cross-platform alternative.

The most recent shift came with Hamiltonian Monte Carlo moving from physics into mainstream statistical software. The Stan probabilistic programming language, released starting in 2012, combined HMC with automatic differentiation so that gradients of the log-posterior could be computed for almost any model written in its syntax. In 2014, Matthew Hoffman and Andrew Gelman published the No-U-Turn Sampler (NUTS) in the Journal of Machine Learning Research, an adaptive form of HMC that removed the need to hand-tune trajectory lengths. NUTS is now the default sampler in both Stan and PyMC.

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. & Teller, E. (1953). "Equation of State Calculations by Fast Computing Machines." The Journal of Chemical Physics, 21(6), 1087–1092. Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97–109.

How MCMC Works: The 6-Step Workflow

📋

Featured Snippet — The MCMC Workflow

Step 1: Specify the prior P(θ). Step 2: Specify the likelihood P(D|θ). Step 3: Form the unnormalized posterior P(D|θ)P(θ). Step 4: Initialize the chain and propose a candidate move. Step 5: Accept or reject the proposal based on the posterior ratio. Step 6: Discard burn-in and summarize the remaining samples.

Specify the Prior

Choose a distribution P(θ) that reflects what is known about each parameter before looking at the current data. This can be informative (based on prior studies) or deliberately vague (a wide Normal or uniform range) when little is known. The choice affects the posterior most when the data is limited.

Specify the Likelihood

Choose a probability model P(D|θ) describing how the observed data would arise for a given value of θ — a Binomial likelihood for pass/fail data, a Normal likelihood for continuous measurements, a Poisson likelihood for counts, and so on.

Form the Unnormalized Posterior

Multiply the prior by the likelihood: P(θ|D) ∝ P(D|θ)P(θ). This product is the function every MCMC algorithm evaluates. It has the same shape as the posterior but isn't guaranteed to integrate to 1 — which, as shown above, doesn't matter.

Initialize the Chain and Propose a Move

Start the chain at some value θ⁽⁰⁾, often chosen randomly or from a quick optimization. At each step t, generate a candidate value θ* from a proposal distribution q(θ*|θ⁽ᵗ⁾) — for example, θ⁽ᵗ⁾ plus a small random step drawn from a normal distribution.

Accept or Reject the Proposal

Compute an acceptance probability that compares the unnormalized posterior at θ* and θ⁽ᵗ⁾ (adjusted for any asymmetry in the proposal). Accept θ* with that probability; otherwise the chain stays at θ⁽ᵗ⁾ for another step. Repeat steps 4–5 thousands or tens of thousands of times.

Discard Burn-in and Summarize

Drop the early iterations recorded before the chain settled into the high-probability region — this is the burn-in or warmup period. Use the remaining samples to compute the posterior mean, standard deviation, and credible intervals for each parameter, and to check convergence (see Section 7).

Anatomy of the Core MCMC Algorithms

The 6-step workflow above describes the family. The algorithms below differ mainly in how step 4 (the proposal) and step 5 (the acceptance rule) are built.

Metropolis-Hastings: The Original Random Walk

Algorithm 1 — Metropolis-Hastings

Acceptance Probability

α = min( 1, [P(θ*|D) q(θ⁽ᵗ⁾|θ*)] / [P(θ⁽ᵗ⁾|D) q(θ*|θ⁽ᵗ⁾)] )

q = proposal distribution If q is symmetric, the q terms cancel and α is just the posterior ratio

The simplest version proposes a new value by adding a small random step to the current one — usually drawn from a Normal distribution centered at zero, which makes the proposal symmetric and cancels the q terms from the formula above. The acceptance probability then reduces to the ratio of the unnormalized posterior at the two points. If the proposal lands in a higher-probability region, it's always accepted; if it lands in a lower-probability region, it's accepted with a probability equal to how much lower.

The size of the proposal step controls how the chain behaves. Steps that are too small move slowly and produce highly correlated samples; steps that are too large get rejected too often and the chain barely moves. In practice, tuning the step size to give an acceptance rate somewhere around 20–50% is a common rule of thumb for random-walk Metropolis in moderate dimensions.

Gibbs Sampling: One Parameter at a Time

Algorithm 2 — Gibbs Sampling

Full Conditional Update

θᵢ⁽ᵗ⁾ ~ P(θᵢ | θ₋ᵢ⁽ᵗ⁻¹⁾, D)

θᵢ = the i-th parameter θ₋ᵢ = all other parameters at their current values

Instead of proposing a move for the whole parameter vector at once, Gibbs sampling cycles through each parameter and draws a new value for it directly from its full conditional distribution — its distribution given the data and the current values of every other parameter. When those conditionals have a known form (which happens for many conjugate Bayesian models), each draw is exact and every proposal is accepted, so there's no rejection step at all.

The trade-off is that Gibbs sampling can move slowly when parameters are highly correlated with each other, because updating one parameter at a time means the chain has to take many small zig-zagging steps to move diagonally through parameter space. It also requires the conditional distributions to be derivable in the first place, which limits it to a narrower class of models than Metropolis-Hastings.

Hamiltonian Monte Carlo: Following the Gradient

Algorithm 3 — Hamiltonian Monte Carlo (HMC)

Hamiltonian Dynamics

dθ/dt = ∂H/∂p dp/dt = −∂H/∂θ

θ = position (the parameters) p = auxiliary momentum variable H = −log-posterior + kinetic energy

HMC introduces an auxiliary "momentum" variable for each parameter and simulates a short trajectory through parameter space using the gradient of the log-posterior, much like a ball rolling across a surface whose height represents how unlikely a point is. Because the trajectory follows the gradient rather than a random direction, it can travel a long way through the posterior in a single step while staying in high-probability territory, which sharply reduces the autocorrelation between successive samples compared with a random walk.

The cost is that HMC needs the gradient of the log-posterior, so the model must be differentiable, and it requires tuning two settings: the step size used to discretize the trajectory (via the "leapfrog" integrator) and the number of steps taken per iteration. Get either wrong and the sampler either wastes computation on tiny steps or produces trajectories that loop back on themselves.

The No-U-Turn Sampler (NUTS): Automating HMC

Algorithm 4 — No-U-Turn Sampler (NUTS)

NUTS removes the second tuning problem in HMC — choosing the number of leapfrog steps — by building a binary tree of candidate positions, doubling its length at each iteration, and stopping automatically once the trajectory starts to turn back toward where it began (a "U-turn"). A position is then sampled from the tree in a way that preserves the correct target distribution. Step size is tuned automatically during warmup to hit a target acceptance rate, typically around 0.8.

Because it removes the most fragile tuning decisions in HMC while keeping its efficiency in high dimensions, NUTS has become the default sampler in both Stan and PyMC, and is also used in NumPyro and TensorFlow Probability. For models with discrete parameters, where gradients don't exist, Gibbs sampling or Metropolis-Hastings is still used instead, sometimes within the same model as NUTS for the continuous parameters.

Hoffman, M.D. & Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15, 1593–1623. Neal, R.M. (2011). "MCMC Using Hamiltonian Dynamics," in Handbook of Markov Chain Monte Carlo.

MCMC vs. Variational Inference vs. Maximum Likelihood

MCMC is one of three broad approaches to fitting a statistical model. Variational inference (VI) replaces sampling with optimization, approximating the posterior with a simpler distribution (often Normal) whose parameters are tuned to be as close as possible to the true posterior. Maximum likelihood estimation (MLE) is the frequentist approach covered throughout the hypothesis testing section — it finds a single best-fitting parameter value without placing a distribution over it at all.

Feature	MCMC	Variational Inference	Maximum Likelihood
What it produces	Samples from the full posterior	An approximate posterior (e.g., Normal)	A single point estimate
Uncertainty	Direct from the sample spread	Often understated (VI tends to underestimate variance)	Requires separate calculation (e.g., standard errors)
Accuracy as resources grow	Approaches the exact posterior	Limited by the chosen approximate family	Converges to the true parameter (under regularity conditions)
Computational cost	Higher; many iterations needed	Lower; an optimization problem	Lowest; often a closed form or quick optimization
Typical use case	Complex hierarchical models, small-to-medium data	Very large datasets, deep generative models	Standard regression, hypothesis tests

In practice, the three methods aren't mutually exclusive. A common workflow is to fit a model with MLE first to find a reasonable starting point, then run MCMC for the full posterior, using VI for a fast approximate check or for models too large for MCMC to handle in a reasonable time.

Checking Convergence: Trace Plots, R-hat, and Effective Sample Size

An MCMC chain is only useful once it has converged — once it has forgotten its starting point and is sampling from the target distribution. There's no way to prove convergence with certainty, but three diagnostics, used together, catch most problems. The standard practice is to run several chains (typically 4) from different starting points and compare them.

Trace Plots

A trace plot shows the sampled value of a parameter on the y-axis against the iteration number on the x-axis, with one line per chain. A converged set of chains looks like a flat band of overlapping noise — sometimes called a "fuzzy caterpillar" — with no visible trend and no chain wandering off on its own. Warning signs include a chain that drifts slowly in one direction, chains that occupy clearly different ranges, or long flat stretches where the chain gets stuck at one value (often a sign the proposal step size is poorly tuned).

R-hat (the Gelman-Rubin Statistic)

R-hat compares the variance of a parameter within each chain to its variance between chains. If all chains have converged to the same distribution, the two variances should be similar and R-hat should be close to 1. If one or more chains are still exploring a different region, the between-chain variance will be inflated and R-hat will be noticeably above 1.

Gelman-Rubin Statistic (Simplified)

R̂ = √( (Between-chain variance + Within-chain variance) / Within-chain variance )

R̂ ≈ 1.00 → chains agree R̂ > 1.01 → run longer or reparameterize

The original 1992 threshold of R-hat < 1.1 is now considered too lenient. A 2021 paper by Aki Vehtari and colleagues in Bayesian Analysis proposed a more conservative rank-normalized, split-chain version of R-hat and recommended a threshold of 1.01 for publication-quality results.

Effective Sample Size (ESS)

Consecutive MCMC samples are correlated with each other — a sample is, after all, generated from the one before it. That autocorrelation means a chain of N samples carries less information than N independent draws from the sampling distribution would. Effective sample size (ESS) estimates how many independent samples would be equivalent.

Effective Sample Size

N_eff = N / ( 1 + 2 Σ ρₖ )

N = total number of post-warmup samples ρₖ = autocorrelation of the chain at lag k

A common rule of thumb is to aim for an ESS of at least 400 across all chains combined for each parameter of interest, and to treat a low ESS relative to the total iteration count as a sign that the sampler is mixing slowly — often solved by switching from a random-walk method to HMC or NUTS, or by reparameterizing the model.

Diagnostic	What It Measures	Healthy Value	If It Fails
Trace plot	Visual mixing and stationarity	Overlapping, flat "fuzzy caterpillar"	Increase warmup, retune step size, or reparameterize
R-hat	Agreement between chains	< 1.01	Run longer chains from more dispersed starting points
Effective sample size	Information content after autocorrelation	> 400 (combined)	Switch to HMC/NUTS, thin the chain, or reparameterize
Divergent transitions (HMC/NUTS)	Numerical breakdown of the integrator	Zero or near-zero	Reduce step size, increase target acceptance rate, reparameterize

⚠️

Diagnostics describe the chain, not the model

Good diagnostics mean the sampler has explored the posterior you specified — they say nothing about whether that posterior is a good description of reality. Model checking (posterior predictive checks, sensitivity to the prior) is a separate step that should follow, not replace, convergence diagnostics.

A Worked Example: Metropolis-Hastings for a Beta-Binomial Model

The cleanest way to learn what MCMC is doing is to apply it to a problem with a known analytical answer, then check that the sampler recovers it. The Beta-Binomial model — estimating an unknown proportion p from a series of successes and failures — is the standard choice, and is the same setup used throughout the probability rules guide.

Worked Example — Estimating a Proportion with Metropolis-Hastings

Problem: Out of 10 independent trials, 7 succeed. With a Uniform(0,1) prior on the success probability p, use Metropolis-Hastings to sample from the posterior of p, and check the result against the known analytical answer.

Unnormalized Posterior

P(p|D) ∝ p⁷(1−p)³ × 1

Likelihood: Binomial(n=10, k=7) ∝ p⁷(1−p)³ Prior: Uniform(0,1) = Beta(1,1)

Model setup: Prior P(p) = Beta(1,1) (uniform). Likelihood P(D|p) = Binomial(10, p) evaluated at 7 successes, proportional to p⁷(1−p)³. Unnormalized posterior: f(p) = p⁷(1−p)³.

Proposal: p* = p⁽ᵗ⁾ + ε, where ε ~ Normal(0, 0.1). The proposal is symmetric, so the acceptance ratio is simply f(p*) / f(p⁽ᵗ⁾). If p* falls outside [0,1], the proposal is automatically rejected (the posterior is zero there).

Initialize: Start the chain at p⁽⁰⁾ = 0.5, a neutral starting point with no information about which direction the true value lies.

One step, by hand: Suppose the proposal draws p* = 0.58.
f(0.5) = 0.5⁷ × 0.5³ = 0.5¹⁰ ≈ 0.000977
f(0.58) = 0.58⁷ × 0.42³ ≈ 0.01535 × 0.07409 ≈ 0.001137
Ratio = f(0.58) / f(0.5) ≈ 1.164

Accept or reject: Since the ratio is greater than 1, the move to p* = 0.58 is accepted automatically (α = min(1, 1.164) = 1). The chain moves to p⁽¹⁾ = 0.58 and the process repeats with a new proposal from there.

After many iterations: Running this process for 5,000 iterations and discarding the first 1,000 as burn-in produces a set of samples whose mean is approximately 0.667 and whose standard deviation is approximately 0.130.

✅ Conclusion: The analytical posterior for this model is Beta(1+7, 1+3) = Beta(8,4), which has a mean of 8/(8+4) = 0.667 and a standard deviation of √[(8×4)/((8+4)²×(8+4+1))] ≈ 0.130. The MCMC estimates match these values closely, which is exactly what should happen when a sampler is implemented correctly — the simulation recovers the same answer the algebra gives directly.

Conjugate Beta-Binomial result and notation follow Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. & Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press, Chapter 2.

Try It: An Interactive Metropolis-Hastings Sampler

The simulator below runs a real random-walk Metropolis-Hastings chain in your browser, targeting either a standard Normal distribution or a bimodal mixture of two Normals. Pick a target, set the proposal step size, choose a number of iterations, and run the sampler. The trace plot shows the path the chain takes; the histogram shows where it ended up, with the true density drawn in red for comparison. The first 10% of samples are discarded as burn-in before the histogram and statistics are computed.

🎲 Metropolis-Hastings Sampler

Try a small step size on the bimodal target — notice how the chain can get stuck in one mode and may need a much larger step (or many more iterations) to cross over to the other.

Target Distribution

Proposal Step Size (σ of proposal)

1.0

Iterations

Trace Plot (sample value vs. iteration, after burn-in)

Histogram of Samples vs. True Density (red)

Click "Run Sampler" to generate a chain.

MCMC Software: Stan, PyMC, and JAGS

Modern Bayesian analysis is almost never done by writing a sampler from scratch, outside of teaching examples like the one above. Probabilistic programming languages let you describe a model declaratively and handle the sampling automatically.

Tool	Interface	Default Sampler	Best Suited For
Stan	Own modeling language, called from R, Python, Julia	NUTS (HMC)	Large hierarchical models; needs a differentiable log-posterior
PyMC	Python, built on PyTensor	NUTS (HMC)	Python-native workflows integrated with NumPy and ArviZ
JAGS / WinBUGS	Declarative BUGS-style syntax	Gibbs / slice sampling	Conjugate or near-conjugate models; teaching
NumPyro / TensorFlow Probability	Python, JAX or TensorFlow backend	NUTS	Large-scale models with GPU acceleration

The general pattern across all of them mirrors the 6-step workflow above: you write down priors and a likelihood, the software builds the unnormalized posterior automatically, and (for Stan, PyMC, NumPyro, and TensorFlow Probability) computes the gradients needed for HMC/NUTS using automatic differentiation rather than by hand. The output is a set of posterior samples in the same format produced by the simulator above, just for models with potentially hundreds of parameters.

Where MCMC Is Used

MCMC's reach extends well beyond academic statistics. Any field that needs uncertainty estimates for a model with many interacting parameters tends to end up using it.

🦠

Epidemiology

Disease transmission models estimate parameters such as the reproduction number R0 from case counts, combining mechanistic models of how infections spread with MCMC to quantify uncertainty in the estimates used for public health decisions.

🧬

Genetics & Phylogenetics

Bayesian phylogenetics software such as BEAST and MrBayes uses MCMC to sample evolutionary trees and mutation rates, producing a distribution over possible tree shapes rather than a single best tree.

📈

Finance & Econometrics

Stochastic volatility and time-varying parameter models, where a quantity like market volatility evolves randomly over time, are typically estimated with MCMC because their likelihoods involve high-dimensional latent variables that don't simplify analytically.

🔭

Astrophysics & Cosmology

Fitting models of exoplanet orbits, galaxy properties, or cosmological parameters to telescope data routinely uses MCMC (often via the widely used emcee Python package) to map out the full range of parameter values consistent with the observations.

🌱

Ecology

Population dynamics and capture-recapture models, used to estimate wildlife population sizes and survival rates from incomplete observation data, rely on MCMC when the underlying state-space models have no closed-form likelihood.

🤖

Machine Learning

Bayesian neural networks, topic models such as Latent Dirichlet Allocation, and other latent-variable models use Gibbs sampling or HMC-based methods to obtain uncertainty estimates that point-estimate training methods don't provide on their own.

MCMC Cheat Sheet

Algorithm Comparison

Algorithm	Core Mechanism	Best For	Main Drawback
Metropolis-Hastings	Random-walk proposal with accept/reject step	General-purpose, low-to-moderate dimensions	High autocorrelation; slow mixing in high dimensions
Gibbs Sampling	Sequential draws from full conditional distributions	Conjugate hierarchical models	Slow when parameters are highly correlated
Hamiltonian Monte Carlo	Gradient-guided trajectories using auxiliary momentum	High-dimensional continuous parameters	Requires gradients and manual tuning of step size and path length
NUTS	HMC with automatic, adaptive path length	Default choice for most modern Bayesian models	Higher per-iteration cost than simple Metropolis

Symbols and Terms Glossary

Term / Symbol	Notation	Meaning
Posterior	P(θ\|D)	Updated distribution of parameters after seeing the data
Prior	P(θ)	Distribution representing belief about parameters before the data
Likelihood	P(D\|θ)	Probability of the observed data given a parameter value
Markov chain	θ⁽ᵗ⁺¹⁾ ~ T(θ\|θ⁽ᵗ⁾)	Sequence where each state depends only on the previous one
Burn-in / warmup	Discard 0 to B	Early iterations removed before summarizing the chain
Acceptance rate	Accepted / Total	Share of proposed moves that were accepted
R-hat	R̂	Gelman-Rubin statistic comparing between- and within-chain variance
Effective sample size	N_eff	Number of independent samples equivalent to the (correlated) chain
Trace plot	θ⁽ᵗ⁾ vs. t	Plot of sampled values over iterations, used to assess mixing
Credible interval	e.g., 95% CI on θ	Bayesian interval containing a given probability of the posterior

Frequently Asked Questions

MCMC is a family of algorithms for drawing samples from a probability distribution that can't be evaluated directly — most often a Bayesian posterior. It constructs a sequence of dependent samples (a Markov chain) whose long-run distribution matches the target, then uses those samples to estimate means, standard deviations, and credible intervals for the parameters of interest.

Bayes' theorem requires dividing by the marginal likelihood — an integral of the prior times the likelihood over every possible parameter value. For a single parameter with a conjugate prior, this integral has a known formula. For models with several parameters or non-conjugate priors, it generally has no closed-form solution, and direct numerical integration becomes impractical as the number of parameters grows. MCMC works with the unnormalized posterior instead, where this integral cancels out.

Metropolis-Hastings is the original MCMC algorithm. At each step it proposes a candidate value from a proposal distribution, then accepts or rejects that candidate based on the ratio of the unnormalized posterior at the candidate and current values (adjusted for any asymmetry in the proposal). Most other MCMC algorithms, including Gibbs sampling and Hamiltonian Monte Carlo, can be viewed as special cases or extensions of this basic idea.

Gibbs sampling updates one parameter at a time by drawing it directly from its full conditional distribution given the data and the current values of every other parameter. Unlike Metropolis-Hastings, there's no accept/reject step — every draw is used. It requires those conditional distributions to have a known form, which is common in conjugate Bayesian models but not universal.

Hamiltonian Monte Carlo (HMC) uses the gradient of the log-posterior to simulate a short physical trajectory through parameter space, treating the parameters as the position of a particle with an auxiliary momentum. Because the trajectory follows the shape of the posterior rather than taking random steps, HMC can move large distances between samples while keeping the acceptance rate high, which produces far less autocorrelation than random-walk methods, especially when there are many parameters.

NUTS is an extension of HMC, introduced by Hoffman and Gelman in 2014, that automatically determines how long each trajectory should be by building a tree of candidate positions and stopping once the path starts to double back on itself. This removes the need to manually choose the number of leapfrog steps, and NUTS is now the default sampler in Stan, PyMC, NumPyro, and TensorFlow Probability.

Run several chains from different starting points and check three things together: trace plots should overlap and look like flat noise with no trend, the Gelman-Rubin statistic (R-hat) should be close to 1 (commonly < 1.01), and the effective sample size should be high enough relative to the total number of iterations — a common minimum target is 400 across all chains for each parameter.

Burn-in refers to the early iterations of a chain, taken before it has reached the high-probability region of the target distribution. These samples are influenced by the arbitrary starting point and don't represent the posterior, so they're discarded before computing summary statistics. In Stan and PyMC this phase is also used to automatically tune the step size for HMC/NUTS.

Stan and PyMC are the most widely used general-purpose tools, both defaulting to NUTS. JAGS and the older WinBUGS use Gibbs and slice sampling and remain common in teaching and for conjugate-style models. NumPyro and TensorFlow Probability provide GPU-accelerated NUTS implementations for Python, and field-specific tools like emcee (astrophysics) and BEAST (phylogenetics) wrap MCMC for particular model types.

Because each MCMC sample is generated from the one before it, consecutive samples are correlated, and a chain of N samples contains less information than N independent draws would. Effective sample size (ESS) estimates the equivalent number of independent samples after accounting for that autocorrelation. A low ESS relative to the chain length signals slow mixing, which usually means more iterations are needed, or a more efficient sampler such as HMC/NUTS should be used instead of a random walk.

Sources and References

This guide draws on the foundational papers that introduced each algorithm, the more recent literature on convergence diagnostics, and the documentation for the software discussed in Section 10.

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. & Teller, E. (1953) — "Equation of State Calculations by Fast Computing Machines." The Journal of Chemical Physics, 21(6), 1087–1092.
Hastings, W.K. (1970) — "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97–109.
Geman, S. & Geman, D. (1984) — "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images." IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741.
Gelfand, A.E. & Smith, A.F.M. (1990) — "Sampling-Based Approaches to Calculating Marginal Densities." Journal of the American Statistical Association, 85(410), 398–409.
Hoffman, M.D. & Gelman, A. (2014) — "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15, 1593–1623.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. & Bürkner, P.C. (2021) — "Rank-Normalization, Folding, and Localization: An Improved R̂ for Assessing Convergence of MCMC." Bayesian Analysis, 16(2), 667–718.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. & Rubin, D.B. (2013) — Bayesian Data Analysis (3rd ed.). CRC Press.
Stan Development Team — Stan Reference Manual and User's Guide. mc-stan.org
PyMC Documentation — PyMC Development Team. pymc.io