What Is Markov Chain Monte Carlo (MCMC)?
The name combines two ideas that each predate modern computing. A Monte Carlo method approximates a quantity using random sampling instead of exact calculation — the name comes from the Monte Carlo casino, a nod to its reliance on chance. A Markov chain is a sequence of random variables where the next value depends only on the current value, not on the full history that came before it. Background on both ideas is covered in the guides on basic probability and Bayes' theorem.
What makes MCMC useful is that it never needs to know the target distribution's exact height, only its shape. Given any function proportional to the posterior, an MCMC algorithm can wander through parameter space, spending more time in regions where that function is large and less time where it's small. After enough steps, a histogram of where the chain has visited approximates the posterior distribution — and from that histogram you can read off means, standard deviations, and credible intervals, the Bayesian counterpart of the confidence interval for a mean.
- Posterior: P(θ|D) ∝ P(D|θ)P(θ) — MCMC samples from the right-hand side without ever computing the normalizing constant
- Markov chain: each sample depends only on the current state, not the full history
- Stationary distribution: the distribution the chain converges to — the target posterior
- Burn-in / warmup: early samples discarded before the chain reaches the high-probability region
- Main algorithms: Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo, NUTS
- Convergence checks: trace plots, the Gelman-Rubin statistic (R-hat), and effective sample size
The Problem MCMC Solves: Bayes' Theorem's Denominator
Bayes' theorem says the posterior distribution of a parameter θ, given data D, equals the likelihood times the prior, divided by a normalizing constant:
θ = parameters being estimated
D = observed data
P(θ) = prior distribution
P(D|θ) = likelihood
P(θ|D) = posterior distribution
The numerator, P(D|θ)P(θ), is usually easy to compute for any specific value of θ — it's just the prior density times the likelihood of the data under that θ. The denominator is the problem. It's the integral of that same product over every possible value of θ, and it's what turns the unnormalized shape on top into a true probability distribution that sums to one.
For a single parameter with a conjugate prior — a Beta prior paired with a Binomial likelihood, for example — that integral has a known closed-form answer, and the posterior is itself a Beta distribution. Once a model has more than two or three parameters, or the prior and likelihood aren't conjugate, the integral typically has no algebraic solution. Direct numerical integration (evaluating the function on a grid and summing) becomes impractical too: a grid with just 100 points per dimension already needs 100 evaluations for one parameter, 10,000 for two, and 10^20 for ten — a version of the curse of dimensionality.
MCMC sidesteps the integral entirely. Because every MCMC algorithm only ever needs the ratio of the posterior at two points, the denominator cancels out:
That single fact is what makes the rest of this guide possible. Every algorithm below works with the unnormalized posterior — the prior times the likelihood — and never needs to know how that quantity relates to a properly scaled probability distribution.
A Short History: From Los Alamos to Stan
MCMC didn't start as a statistics tool. In 1953, a team at Los Alamos National Laboratory — Nicholas Metropolis, Arianna and Marshall Rosenbluth, and Augusta and Edward Teller — published "Equation of State Calculations by Fast Computing Machines" in the Journal of Chemical Physics. The paper described what's now called the Metropolis algorithm, used to simulate how particles in a fluid settle into equilibrium on one of the earliest electronic computers, the MANIAC.
In 1970, statistician W.K. Hastings generalized the method in Biometrika, allowing the proposal step to be asymmetric — the version now known as Metropolis-Hastings. In 1984, Stuart and Donald Geman introduced Gibbs sampling for image restoration in a paper for IEEE Transactions on Pattern Analysis and Machine Intelligence, borrowing the name from physicist Josiah Willard Gibbs.
The turning point for statistics came in 1990, when Alan Gelfand and Adrian Smith published "Sampling-Based Approaches to Calculating Marginal Densities" in the Journal of the American Statistical Association. The paper showed that Gibbs sampling could solve Bayesian computation problems that had been considered intractable, and is widely credited with starting the modern era of applied Bayesian statistics. Through the 1990s, the BUGS and WinBUGS software packages made Gibbs sampling accessible to applied researchers without a programming background, and the open-source JAGS followed as a cross-platform alternative.
The most recent shift came with Hamiltonian Monte Carlo moving from physics into mainstream statistical software. The Stan probabilistic programming language, released starting in 2012, combined HMC with automatic differentiation so that gradients of the log-posterior could be computed for almost any model written in its syntax. In 2014, Matthew Hoffman and Andrew Gelman published the No-U-Turn Sampler (NUTS) in the Journal of Machine Learning Research, an adaptive form of HMC that removed the need to hand-tune trajectory lengths. NUTS is now the default sampler in both Stan and PyMC.
How MCMC Works: The 6-Step Workflow
Step 1: Specify the prior P(θ). Step 2: Specify the likelihood P(D|θ). Step 3: Form the unnormalized posterior P(D|θ)P(θ). Step 4: Initialize the chain and propose a candidate move. Step 5: Accept or reject the proposal based on the posterior ratio. Step 6: Discard burn-in and summarize the remaining samples.
Specify the Prior
Choose a distribution P(θ) that reflects what is known about each parameter before looking at the current data. This can be informative (based on prior studies) or deliberately vague (a wide Normal or uniform range) when little is known. The choice affects the posterior most when the data is limited.
Specify the Likelihood
Choose a probability model P(D|θ) describing how the observed data would arise for a given value of θ — a Binomial likelihood for pass/fail data, a Normal likelihood for continuous measurements, a Poisson likelihood for counts, and so on.
Form the Unnormalized Posterior
Multiply the prior by the likelihood: P(θ|D) ∝ P(D|θ)P(θ). This product is the function every MCMC algorithm evaluates. It has the same shape as the posterior but isn't guaranteed to integrate to 1 — which, as shown above, doesn't matter.
Initialize the Chain and Propose a Move
Start the chain at some value θ⁽⁰⁾, often chosen randomly or from a quick optimization. At each step t, generate a candidate value θ* from a proposal distribution q(θ*|θ⁽ᵗ⁾) — for example, θ⁽ᵗ⁾ plus a small random step drawn from a normal distribution.
Accept or Reject the Proposal
Compute an acceptance probability that compares the unnormalized posterior at θ* and θ⁽ᵗ⁾ (adjusted for any asymmetry in the proposal). Accept θ* with that probability; otherwise the chain stays at θ⁽ᵗ⁾ for another step. Repeat steps 4–5 thousands or tens of thousands of times.
Discard Burn-in and Summarize
Drop the early iterations recorded before the chain settled into the high-probability region — this is the burn-in or warmup period. Use the remaining samples to compute the posterior mean, standard deviation, and credible intervals for each parameter, and to check convergence (see Section 7).
Anatomy of the Core MCMC Algorithms
The 6-step workflow above describes the family. The algorithms below differ mainly in how step 4 (the proposal) and step 5 (the acceptance rule) are built.
Metropolis-Hastings: The Original Random Walk
q = proposal distribution
If q is symmetric, the q terms cancel and α is just the posterior ratio
The simplest version proposes a new value by adding a small random step to the current one — usually drawn from a Normal distribution centered at zero, which makes the proposal symmetric and cancels the q terms from the formula above. The acceptance probability then reduces to the ratio of the unnormalized posterior at the two points. If the proposal lands in a higher-probability region, it's always accepted; if it lands in a lower-probability region, it's accepted with a probability equal to how much lower.
The size of the proposal step controls how the chain behaves. Steps that are too small move slowly and produce highly correlated samples; steps that are too large get rejected too often and the chain barely moves. In practice, tuning the step size to give an acceptance rate somewhere around 20–50% is a common rule of thumb for random-walk Metropolis in moderate dimensions.
Gibbs Sampling: One Parameter at a Time
θᵢ = the i-th parameter
θ₋ᵢ = all other parameters at their current values
Instead of proposing a move for the whole parameter vector at once, Gibbs sampling cycles through each parameter and draws a new value for it directly from its full conditional distribution — its distribution given the data and the current values of every other parameter. When those conditionals have a known form (which happens for many conjugate Bayesian models), each draw is exact and every proposal is accepted, so there's no rejection step at all.
The trade-off is that Gibbs sampling can move slowly when parameters are highly correlated with each other, because updating one parameter at a time means the chain has to take many small zig-zagging steps to move diagonally through parameter space. It also requires the conditional distributions to be derivable in the first place, which limits it to a narrower class of models than Metropolis-Hastings.
Hamiltonian Monte Carlo: Following the Gradient
θ = position (the parameters)
p = auxiliary momentum variable
H = −log-posterior + kinetic energy
HMC introduces an auxiliary "momentum" variable for each parameter and simulates a short trajectory through parameter space using the gradient of the log-posterior, much like a ball rolling across a surface whose height represents how unlikely a point is. Because the trajectory follows the gradient rather than a random direction, it can travel a long way through the posterior in a single step while staying in high-probability territory, which sharply reduces the autocorrelation between successive samples compared with a random walk.
The cost is that HMC needs the gradient of the log-posterior, so the model must be differentiable, and it requires tuning two settings: the step size used to discretize the trajectory (via the "leapfrog" integrator) and the number of steps taken per iteration. Get either wrong and the sampler either wastes computation on tiny steps or produces trajectories that loop back on themselves.
The No-U-Turn Sampler (NUTS): Automating HMC
NUTS removes the second tuning problem in HMC — choosing the number of leapfrog steps — by building a binary tree of candidate positions, doubling its length at each iteration, and stopping automatically once the trajectory starts to turn back toward where it began (a "U-turn"). A position is then sampled from the tree in a way that preserves the correct target distribution. Step size is tuned automatically during warmup to hit a target acceptance rate, typically around 0.8.
Because it removes the most fragile tuning decisions in HMC while keeping its efficiency in high dimensions, NUTS has become the default sampler in both Stan and PyMC, and is also used in NumPyro and TensorFlow Probability. For models with discrete parameters, where gradients don't exist, Gibbs sampling or Metropolis-Hastings is still used instead, sometimes within the same model as NUTS for the continuous parameters.
MCMC vs. Variational Inference vs. Maximum Likelihood
MCMC is one of three broad approaches to fitting a statistical model. Variational inference (VI) replaces sampling with optimization, approximating the posterior with a simpler distribution (often Normal) whose parameters are tuned to be as close as possible to the true posterior. Maximum likelihood estimation (MLE) is the frequentist approach covered throughout the hypothesis testing section — it finds a single best-fitting parameter value without placing a distribution over it at all.
| Feature | MCMC | Variational Inference | Maximum Likelihood |
|---|---|---|---|
| What it produces | Samples from the full posterior | An approximate posterior (e.g., Normal) | A single point estimate |
| Uncertainty | Direct from the sample spread | Often understated (VI tends to underestimate variance) | Requires separate calculation (e.g., standard errors) |
| Accuracy as resources grow | Approaches the exact posterior | Limited by the chosen approximate family | Converges to the true parameter (under regularity conditions) |
| Computational cost | Higher; many iterations needed | Lower; an optimization problem | Lowest; often a closed form or quick optimization |
| Typical use case | Complex hierarchical models, small-to-medium data | Very large datasets, deep generative models | Standard regression, hypothesis tests |
In practice, the three methods aren't mutually exclusive. A common workflow is to fit a model with MLE first to find a reasonable starting point, then run MCMC for the full posterior, using VI for a fast approximate check or for models too large for MCMC to handle in a reasonable time.
Checking Convergence: Trace Plots, R-hat, and Effective Sample Size
An MCMC chain is only useful once it has converged — once it has forgotten its starting point and is sampling from the target distribution. There's no way to prove convergence with certainty, but three diagnostics, used together, catch most problems. The standard practice is to run several chains (typically 4) from different starting points and compare them.
Trace Plots
A trace plot shows the sampled value of a parameter on the y-axis against the iteration number on the x-axis, with one line per chain. A converged set of chains looks like a flat band of overlapping noise — sometimes called a "fuzzy caterpillar" — with no visible trend and no chain wandering off on its own. Warning signs include a chain that drifts slowly in one direction, chains that occupy clearly different ranges, or long flat stretches where the chain gets stuck at one value (often a sign the proposal step size is poorly tuned).
R-hat (the Gelman-Rubin Statistic)
R-hat compares the variance of a parameter within each chain to its variance between chains. If all chains have converged to the same distribution, the two variances should be similar and R-hat should be close to 1. If one or more chains are still exploring a different region, the between-chain variance will be inflated and R-hat will be noticeably above 1.
The original 1992 threshold of R-hat < 1.1 is now considered too lenient. A 2021 paper by Aki Vehtari and colleagues in Bayesian Analysis proposed a more conservative rank-normalized, split-chain version of R-hat and recommended a threshold of 1.01 for publication-quality results.
Effective Sample Size (ESS)
Consecutive MCMC samples are correlated with each other — a sample is, after all, generated from the one before it. That autocorrelation means a chain of N samples carries less information than N independent draws from the sampling distribution would. Effective sample size (ESS) estimates how many independent samples would be equivalent.
N = total number of post-warmup samples
ρₖ = autocorrelation of the chain at lag k
A common rule of thumb is to aim for an ESS of at least 400 across all chains combined for each parameter of interest, and to treat a low ESS relative to the total iteration count as a sign that the sampler is mixing slowly — often solved by switching from a random-walk method to HMC or NUTS, or by reparameterizing the model.
| Diagnostic | What It Measures | Healthy Value | If It Fails |
|---|---|---|---|
| Trace plot | Visual mixing and stationarity | Overlapping, flat "fuzzy caterpillar" | Increase warmup, retune step size, or reparameterize |
| R-hat | Agreement between chains | < 1.01 | Run longer chains from more dispersed starting points |
| Effective sample size | Information content after autocorrelation | > 400 (combined) | Switch to HMC/NUTS, thin the chain, or reparameterize |
| Divergent transitions (HMC/NUTS) | Numerical breakdown of the integrator | Zero or near-zero | Reduce step size, increase target acceptance rate, reparameterize |
Good diagnostics mean the sampler has explored the posterior you specified — they say nothing about whether that posterior is a good description of reality. Model checking (posterior predictive checks, sensitivity to the prior) is a separate step that should follow, not replace, convergence diagnostics.
A Worked Example: Metropolis-Hastings for a Beta-Binomial Model
The cleanest way to learn what MCMC is doing is to apply it to a problem with a known analytical answer, then check that the sampler recovers it. The Beta-Binomial model — estimating an unknown proportion p from a series of successes and failures — is the standard choice, and is the same setup used throughout the probability rules guide.
Problem: Out of 10 independent trials, 7 succeed. With a Uniform(0,1) prior on the success probability p, use Metropolis-Hastings to sample from the posterior of p, and check the result against the known analytical answer.
Model setup: Prior P(p) = Beta(1,1) (uniform). Likelihood P(D|p) = Binomial(10, p) evaluated at 7 successes, proportional to p⁷(1−p)³. Unnormalized posterior: f(p) = p⁷(1−p)³.
Proposal: p* = p⁽ᵗ⁾ + ε, where ε ~ Normal(0, 0.1). The proposal is symmetric, so the acceptance ratio is simply f(p*) / f(p⁽ᵗ⁾). If p* falls outside [0,1], the proposal is automatically rejected (the posterior is zero there).
Initialize: Start the chain at p⁽⁰⁾ = 0.5, a neutral starting point with no information about which direction the true value lies.
One step, by hand: Suppose the proposal draws p* = 0.58.
f(0.5) = 0.5⁷ × 0.5³ = 0.5¹⁰ ≈ 0.000977
f(0.58) = 0.58⁷ × 0.42³ ≈ 0.01535 × 0.07409 ≈ 0.001137
Ratio = f(0.58) / f(0.5) ≈ 1.164
Accept or reject: Since the ratio is greater than 1, the move to p* = 0.58 is accepted automatically (α = min(1, 1.164) = 1). The chain moves to p⁽¹⁾ = 0.58 and the process repeats with a new proposal from there.
After many iterations: Running this process for 5,000 iterations and discarding the first 1,000 as burn-in produces a set of samples whose mean is approximately 0.667 and whose standard deviation is approximately 0.130.
✅ Conclusion: The analytical posterior for this model is Beta(1+7, 1+3) = Beta(8,4), which has a mean of 8/(8+4) = 0.667 and a standard deviation of √[(8×4)/((8+4)²×(8+4+1))] ≈ 0.130. The MCMC estimates match these values closely, which is exactly what should happen when a sampler is implemented correctly — the simulation recovers the same answer the algebra gives directly.
Try It: An Interactive Metropolis-Hastings Sampler
The simulator below runs a real random-walk Metropolis-Hastings chain in your browser, targeting either a standard Normal distribution or a bimodal mixture of two Normals. Pick a target, set the proposal step size, choose a number of iterations, and run the sampler. The trace plot shows the path the chain takes; the histogram shows where it ended up, with the true density drawn in red for comparison. The first 10% of samples are discarded as burn-in before the histogram and statistics are computed.
🎲 Metropolis-Hastings Sampler
Try a small step size on the bimodal target — notice how the chain can get stuck in one mode and may need a much larger step (or many more iterations) to cross over to the other.
MCMC Software: Stan, PyMC, and JAGS
Modern Bayesian analysis is almost never done by writing a sampler from scratch, outside of teaching examples like the one above. Probabilistic programming languages let you describe a model declaratively and handle the sampling automatically.
| Tool | Interface | Default Sampler | Best Suited For |
|---|---|---|---|
| Stan | Own modeling language, called from R, Python, Julia | NUTS (HMC) | Large hierarchical models; needs a differentiable log-posterior |
| PyMC | Python, built on PyTensor | NUTS (HMC) | Python-native workflows integrated with NumPy and ArviZ |
| JAGS / WinBUGS | Declarative BUGS-style syntax | Gibbs / slice sampling | Conjugate or near-conjugate models; teaching |
| NumPyro / TensorFlow Probability | Python, JAX or TensorFlow backend | NUTS | Large-scale models with GPU acceleration |
The general pattern across all of them mirrors the 6-step workflow above: you write down priors and a likelihood, the software builds the unnormalized posterior automatically, and (for Stan, PyMC, NumPyro, and TensorFlow Probability) computes the gradients needed for HMC/NUTS using automatic differentiation rather than by hand. The output is a set of posterior samples in the same format produced by the simulator above, just for models with potentially hundreds of parameters.
Where MCMC Is Used
MCMC's reach extends well beyond academic statistics. Any field that needs uncertainty estimates for a model with many interacting parameters tends to end up using it.
Epidemiology
Disease transmission models estimate parameters such as the reproduction number R0 from case counts, combining mechanistic models of how infections spread with MCMC to quantify uncertainty in the estimates used for public health decisions.
Genetics & Phylogenetics
Bayesian phylogenetics software such as BEAST and MrBayes uses MCMC to sample evolutionary trees and mutation rates, producing a distribution over possible tree shapes rather than a single best tree.
Finance & Econometrics
Stochastic volatility and time-varying parameter models, where a quantity like market volatility evolves randomly over time, are typically estimated with MCMC because their likelihoods involve high-dimensional latent variables that don't simplify analytically.
Astrophysics & Cosmology
Fitting models of exoplanet orbits, galaxy properties, or cosmological parameters to telescope data routinely uses MCMC (often via the widely used emcee Python package) to map out the full range of parameter values consistent with the observations.
Ecology
Population dynamics and capture-recapture models, used to estimate wildlife population sizes and survival rates from incomplete observation data, rely on MCMC when the underlying state-space models have no closed-form likelihood.
Machine Learning
Bayesian neural networks, topic models such as Latent Dirichlet Allocation, and other latent-variable models use Gibbs sampling or HMC-based methods to obtain uncertainty estimates that point-estimate training methods don't provide on their own.
MCMC Cheat Sheet
Algorithm Comparison
| Algorithm | Core Mechanism | Best For | Main Drawback |
|---|---|---|---|
| Metropolis-Hastings | Random-walk proposal with accept/reject step | General-purpose, low-to-moderate dimensions | High autocorrelation; slow mixing in high dimensions |
| Gibbs Sampling | Sequential draws from full conditional distributions | Conjugate hierarchical models | Slow when parameters are highly correlated |
| Hamiltonian Monte Carlo | Gradient-guided trajectories using auxiliary momentum | High-dimensional continuous parameters | Requires gradients and manual tuning of step size and path length |
| NUTS | HMC with automatic, adaptive path length | Default choice for most modern Bayesian models | Higher per-iteration cost than simple Metropolis |
Symbols and Terms Glossary
| Term / Symbol | Notation | Meaning |
|---|---|---|
| Posterior | P(θ|D) | Updated distribution of parameters after seeing the data |
| Prior | P(θ) | Distribution representing belief about parameters before the data |
| Likelihood | P(D|θ) | Probability of the observed data given a parameter value |
| Markov chain | θ⁽ᵗ⁺¹⁾ ~ T(θ|θ⁽ᵗ⁾) | Sequence where each state depends only on the previous one |
| Burn-in / warmup | Discard 0 to B | Early iterations removed before summarizing the chain |
| Acceptance rate | Accepted / Total | Share of proposed moves that were accepted |
| R-hat | R̂ | Gelman-Rubin statistic comparing between- and within-chain variance |
| Effective sample size | N_eff | Number of independent samples equivalent to the (correlated) chain |
| Trace plot | θ⁽ᵗ⁾ vs. t | Plot of sampled values over iterations, used to assess mixing |
| Credible interval | e.g., 95% CI on θ | Bayesian interval containing a given probability of the posterior |
Frequently Asked Questions
MCMC is a family of algorithms for drawing samples from a probability distribution that can't be evaluated directly — most often a Bayesian posterior. It constructs a sequence of dependent samples (a Markov chain) whose long-run distribution matches the target, then uses those samples to estimate means, standard deviations, and credible intervals for the parameters of interest.
Bayes' theorem requires dividing by the marginal likelihood — an integral of the prior times the likelihood over every possible parameter value. For a single parameter with a conjugate prior, this integral has a known formula. For models with several parameters or non-conjugate priors, it generally has no closed-form solution, and direct numerical integration becomes impractical as the number of parameters grows. MCMC works with the unnormalized posterior instead, where this integral cancels out.
Metropolis-Hastings is the original MCMC algorithm. At each step it proposes a candidate value from a proposal distribution, then accepts or rejects that candidate based on the ratio of the unnormalized posterior at the candidate and current values (adjusted for any asymmetry in the proposal). Most other MCMC algorithms, including Gibbs sampling and Hamiltonian Monte Carlo, can be viewed as special cases or extensions of this basic idea.
Gibbs sampling updates one parameter at a time by drawing it directly from its full conditional distribution given the data and the current values of every other parameter. Unlike Metropolis-Hastings, there's no accept/reject step — every draw is used. It requires those conditional distributions to have a known form, which is common in conjugate Bayesian models but not universal.
Hamiltonian Monte Carlo (HMC) uses the gradient of the log-posterior to simulate a short physical trajectory through parameter space, treating the parameters as the position of a particle with an auxiliary momentum. Because the trajectory follows the shape of the posterior rather than taking random steps, HMC can move large distances between samples while keeping the acceptance rate high, which produces far less autocorrelation than random-walk methods, especially when there are many parameters.
NUTS is an extension of HMC, introduced by Hoffman and Gelman in 2014, that automatically determines how long each trajectory should be by building a tree of candidate positions and stopping once the path starts to double back on itself. This removes the need to manually choose the number of leapfrog steps, and NUTS is now the default sampler in Stan, PyMC, NumPyro, and TensorFlow Probability.
Run several chains from different starting points and check three things together: trace plots should overlap and look like flat noise with no trend, the Gelman-Rubin statistic (R-hat) should be close to 1 (commonly < 1.01), and the effective sample size should be high enough relative to the total number of iterations — a common minimum target is 400 across all chains for each parameter.
Burn-in refers to the early iterations of a chain, taken before it has reached the high-probability region of the target distribution. These samples are influenced by the arbitrary starting point and don't represent the posterior, so they're discarded before computing summary statistics. In Stan and PyMC this phase is also used to automatically tune the step size for HMC/NUTS.
Stan and PyMC are the most widely used general-purpose tools, both defaulting to NUTS. JAGS and the older WinBUGS use Gibbs and slice sampling and remain common in teaching and for conjugate-style models. NumPyro and TensorFlow Probability provide GPU-accelerated NUTS implementations for Python, and field-specific tools like emcee (astrophysics) and BEAST (phylogenetics) wrap MCMC for particular model types.
Because each MCMC sample is generated from the one before it, consecutive samples are correlated, and a chain of N samples contains less information than N independent draws would. Effective sample size (ESS) estimates the equivalent number of independent samples after accounting for that autocorrelation. A low ESS relative to the chain length signals slow mixing, which usually means more iterations are needed, or a more efficient sampler such as HMC/NUTS should be used instead of a random walk.
Sources and References
This guide draws on the foundational papers that introduced each algorithm, the more recent literature on convergence diagnostics, and the documentation for the software discussed in Section 10.
- Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. & Teller, E. (1953) — "Equation of State Calculations by Fast Computing Machines." The Journal of Chemical Physics, 21(6), 1087–1092.
- Hastings, W.K. (1970) — "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97–109.
- Geman, S. & Geman, D. (1984) — "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images." IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741.
- Gelfand, A.E. & Smith, A.F.M. (1990) — "Sampling-Based Approaches to Calculating Marginal Densities." Journal of the American Statistical Association, 85(410), 398–409.
- Hoffman, M.D. & Gelman, A. (2014) — "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15, 1593–1623.
- Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. & Bürkner, P.C. (2021) — "Rank-Normalization, Folding, and Localization: An Improved R̂ for Assessing Convergence of MCMC." Bayesian Analysis, 16(2), 667–718.
- Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. & Rubin, D.B. (2013) — Bayesian Data Analysis (3rd ed.). CRC Press.
- Stan Development Team — Stan Reference Manual and User's Guide. mc-stan.org
- PyMC Documentation — PyMC Development Team. pymc.io