What Is a Bayes Factor?
The Bayes Factor sits at the heart of Bayesian hypothesis testing. To understand it, start with a basic question any researcher faces after collecting data: given what I observed, which of my two competing explanations does the data favor, and by how much? A p-value answers a different question — it tells you how unlikely your data would be if the null hypothesis were true, but it says nothing about the alternative. The Bayes Factor compares both hypotheses head-to-head against the same data.
Formally, the marginal likelihood P(D | H) is the probability of the observed data integrated across all possible values of the model's parameters, weighted by the prior distribution on those parameters. This integration is what separates the Bayes Factor from a simple likelihood ratio, which evaluates models only at their best-fit parameter values.
Harold Jeffreys developed the framework in the 1930s and 1940s, publishing it fully in his 1961 book Theory of Probability. His goal was a method that could confirm a null effect, not merely fail to reject it — something classical significance testing cannot do. Researchers in psychology, medicine, and data science have returned to Bayes Factors with growing frequency because they address real inferential needs that frequentist tools leave unmet. The broader context for this sits in the Bayes theorem and conditional probability topics on Statistics Fundamentals.
- BF₁₀: Evidence for the alternative hypothesis H₁ relative to H₀ — the most commonly reported direction
- BF₀₁: Evidence for the null hypothesis H₀ relative to H₁. BF₀₁ = 1 / BF₁₀
- BF = 3: The minimum threshold many journals consider noteworthy evidence
- Prior sensitivity: Results depend on the prior distribution specified for effect sizes under H₁
- Null confirmation: Unlike p-values, Bayes Factors can provide positive evidence for H₀ when BF₀₁ > 1
- No arbitrary cutoffs: The Bayes Factor is a continuous measure — interpretation is graduated, not binary
The Bayes Factor Formula
The formula derives directly from Bayes' theorem applied to competing models. Start with the relationship between prior odds, posterior odds, and the Bayes Factor:
P(D | H₁) = marginal likelihood of data under H₁
P(D | H₀) = marginal likelihood of data under H₀
BF₁₀ = evidence ratio for H₁ vs H₀
BF₀₁ = 1 / BF₁₀
The marginal likelihood for each hypothesis is computed by integrating the likelihood function over the prior distribution of the model's parameters:
θ = model parameter(s)
P(D | θ, H₁) = likelihood of data given θ
P(θ | H₁) = prior distribution on θ under H₁
Understanding Marginal Likelihoods
The marginal likelihood is often called the "model evidence" because it measures how well a hypothesis predicts the observed data overall — not just at the best-fit parameter value. A model that is flexible enough to fit many possible datasets will not be rewarded as much as a model that makes precise predictions that happen to match what you observed.
This is why the Bayes Factor naturally penalizes overly complex models. If H₁ specifies that a wide range of effect sizes are plausible, its marginal likelihood is spread thin across many possible predictions. A more focused prior — one that predicts more precisely what you found — yields higher marginal likelihood. This automatic complexity penalty is a feature, not a bug: it implements Occam's Razor mathematically.
The Savage-Dickey Density Ratio
For nested models — where H₀ is a special case of H₁ with a parameter fixed to a specific value, such as zero — the Bayes Factor has an elegant simplification known as the Savage-Dickey density ratio:
θ₀ = parameter value under H₀ (often 0)
P(θ₀ | H₁) = prior density at θ₀
P(θ₀ | D, H₁) = posterior density at θ₀
This ratio compares the prior density and posterior density at the null value θ₀. When the data shift probability mass away from θ₀, the posterior density at θ₀ becomes smaller than the prior density. Consequently, BF₁₀ exceeds 1, indicating evidence in favor of H₁. In contrast, when the data concentrate probability mass around θ₀, the posterior density at θ₀ becomes larger than the prior density. Consequently, BF₁₀ is less than 1, indicating evidence in favor of H₀. This computational approach is known as the Savage–Dickey density ratio and is used in the default Bayesian t-tests implemented in JASP.
BF10 vs BF01: Which Direction to Report
BF₁₀ and BF₀₁ measure the same evidence; they just flip the direction of the comparison. BF₁₀ quantifies how much more strongly the data support H₁ over H₀. BF₀₁ quantifies how much more strongly the data support H₀ over H₁. They are exact mathematical reciprocals:
Convention in most research fields is to report BF₁₀ when results favor the alternative and BF₀₁ when results favor the null, always choosing the direction greater than 1 for clarity. Some journals and the JASP statistical software report both. When reading research papers, check which subscript the authors used before interpreting the magnitude.
| Statistic | Measures Evidence For | Greater Than 1 When | Less Than 1 When |
|---|---|---|---|
| BF₁₀ | H₁ over H₀ | Data supports H₁ | Data supports H₀ |
| BF₀₁ | H₀ over H₁ | Data supports H₀ | Data supports H₁ |
| log(BF₁₀) | H₁ over H₀ (log scale) | Positive values → H₁ | Negative values → H₀ |
The Log Bayes Factor Scale
Because Bayes Factors can range from near zero to very large numbers, researchers working with extreme evidence often use the natural logarithm, written log(BF). On the log scale, values above zero favor H₁, values below zero favor H₀, and the magnitude indicates strength symmetrically in both directions. A log(BF₁₀) of 2.3 corresponds to BF₁₀ = e^2.3 ≈ 10 — strong evidence for the alternative. Log Bayes Factors also add naturally when combining independent pieces of evidence, making them useful in meta-analytic settings.
The Jeffreys Scale: Interpreting Bayes Factor Strength
Harold Jeffreys proposed a graduated classification of evidence strength in 1961. The scale has been refined and relabeled by various researchers since then, but the numerical thresholds remain standard across most research fields. The table below combines the original Jeffreys thresholds with the labels used by JASP and the University of Amsterdam's Bayesian statistics group.
Jeffreys Scale — Complete Evidence Classification
| BF₁₀ Value | Evidence Strength | Favors | Practical Interpretation |
|---|---|---|---|
| > 100 | Extreme | H₁ | Overwhelming support; effect is essentially certain |
| 30 – 100 | Very Strong | H₁ | Reliable experimental result; highly replicable |
| 10 – 30 | Strong | H₁ | Clear evidence; effect deserves publication |
| 3 – 10 | Moderate | H₁ | Noteworthy trend; replication recommended |
| 1 – 3 | Anecdotal | H₁ | Weak, barely distinguishable from chance variation |
| 1 | No Evidence | Neither | Data equally consistent with both hypotheses |
| 1/3 – 1 | Anecdotal | H₀ | Slight trend toward null; not meaningful alone |
| 1/10 – 1/3 | Moderate | H₀ | Data favor null; report BF₀₁ = 3–10 |
| 1/30 – 1/10 | Strong | H₀ | Meaningful null confirmation; effect likely absent |
| 1/100 – 1/30 | Very Strong | H₀ | Robust evidence of null; strong replication candidate |
| < 1/100 | Extreme | H₀ | Decisive null result; effect almost certainly absent |
The Jeffreys Scale categories are interpretive conventions, not statistical laws. A BF₁₀ of 2.9 and a BF₁₀ of 3.1 are not meaningfully different, even though one falls in "anecdotal" and the other in "moderate." Always report the exact numerical value alongside the category label, and interpret in context.
How to Calculate a Bayes Factor: 6 Steps
Step 1: Define H₀ and H₁. Step 2: Specify prior distributions on parameters. Step 3: Collect data. Step 4: Compute marginal likelihoods. Step 5: Take the ratio BF₁₀ = P(D|H₁) / P(D|H₀). Step 6: Interpret using the Jeffreys Scale.
Define the Competing Hypotheses
Write H₀ as a specific constraint on your parameters — typically that an effect is zero or that two groups are equal: H₀: μ₁ = μ₂ or H₀: δ = 0. Write H₁ as a prior distribution over possible effect sizes: for a t-test, H₁: δ ~ Cauchy(0, 0.707) is the JASP default. The key difference from frequentist testing is that H₁ must be specified precisely, not just as "some difference exists."
Specify Prior Distributions
The prior distribution under H₁ is the most consequential design choice in a Bayes Factor analysis. A Cauchy distribution centered at zero with scale r = 0.707 is the standard default for effect sizes in many psychological tests, chosen because it is scale-invariant and gives reasonable weight to a wide range of effects. Narrow priors reward more precise predictions; wide priors are more conservative. Always report the prior you used.
Collect Your Data
Unlike frequentist testing, Bayesian inference allows you to update the Bayes Factor continuously as data accumulate. Sequential updating is mathematically valid here — each new data point updates the posterior, which then becomes the prior for the next observation. This property makes Bayes Factors valuable in adaptive research designs where sample size is not fixed in advance.
Compute the Marginal Likelihoods
For simple models like the Bayesian t-test or correlation test, closed-form solutions exist and software like R's BayesFactor package or JASP computes them directly. For complex models, numerical integration via Monte Carlo methods or bridge sampling is required. The marginal likelihood under H₀ for a one-sample t-test is simply the likelihood of the data under the fixed null value; the marginal likelihood under H₁ integrates the likelihood over the Cauchy prior.
Compute the Ratio
Divide P(D | H₁) by P(D | H₀). The result is BF₁₀. If it is greater than 1, the data favor H₁ by that factor. If it is less than 1, the data favor H₀ — in which case report BF₀₁ = 1/BF₁₀ for clarity. Both values carry the same information; convention favors reporting the direction greater than 1 so readers can read the strength directly.
Interpret and Report
Map your BF value to the Jeffreys Scale, report the exact number, the prior specification, and the direction of evidence. Example: "A Bayesian independent-samples t-test with a Cauchy prior (r = 0.707) yielded BF₁₀ = 18.5, indicating strong evidence for the alternative hypothesis." Never reduce the result to a binary decision the way p-values are often misused — the Bayes Factor is a continuous measure of evidence.
Bayes Factor Worked Examples
The two examples below follow the 6-step process. Calculations use the standard Cauchy prior with scale r = 0.707, matching the default in JASP and R's BayesFactor package. Both examples show the full reasoning chain from raw data to an interpretable conclusion.
Example 1 — Bayesian Independent-Samples T-Test (Psychology)
Problem: A psychologist tests whether a memory training intervention improves spatial recall. Control group (n = 40) scores a mean of 52.3 (SD = 8.1). Intervention group (n = 42) scores a mean of 58.7 (SD = 7.9). Prior: Cauchy(0, 0.707) on standardized effect size.
s_pooled = pooled standard deviation
df = n₁ + n₂ − 2 = 80
Hypotheses: H₀: δ = 0 (no effect) | H₁: δ ~ Cauchy(0, 0.707) — the effect size follows a Cauchy distribution centered at zero
Prior: Standard Cauchy with scale r = 0.707. This prior assigns 50% probability to |δ| > 0.707, reflecting uncertainty about whether the effect is small or large before seeing the data.
Compute Cohen's d:
s_pooled = √[((39 × 8.1²) + (41 × 7.9²)) / 80] = √[(2559.2 + 2559.6) / 80] = √64.0 ≈ 8.0
d = (58.7 − 52.3) / 8.0 = 6.4 / 8.0 = 0.80
Marginal likelihoods: Using the Rouder et al. (2009) formula for the Bayesian independent t-test with n₁ = 40, n₂ = 42, d = 0.80, and r = 0.707, the marginal likelihood ratio yields BF₁₀ ≈ 18.5.
Interpretation: BF₁₀ = 18.5 falls in the 10–30 range on the Jeffreys Scale, meaning strong evidence for the alternative hypothesis. The observed data are 18.5 times more probable under H₁ than under H₀.
Reporting: "A Bayesian independent-samples t-test with a Cauchy prior (r = 0.707) yielded BF₁₀ = 18.5, providing strong evidence that memory training improves spatial recall (d = 0.80)."
✅ Conclusion: The data are 18.5 times more likely under the alternative hypothesis than under the null. This constitutes strong evidence (Jeffreys Scale) that the intervention improved spatial memory. Replication with a pre-registered design is still recommended before drawing causal conclusions.
Example 2 — Bayesian A/B Test Confirming a Null Effect (Digital Marketing)
Problem: An e-commerce team runs an A/B test on a checkout redesign. After 10,000 visitors per variant, conversion rates are 4.82% (A) and 4.89% (B) — a difference of 0.07 percentage points. The team wants to know whether to implement variant B or stop the test.
Hypotheses: H₀: p_A = p_B (no conversion rate difference) | H₁: p_A ≠ p_B (there is a difference)
Prior: Beta distribution priors on conversion rates, equivalent to Cauchy prior on standardized difference with r = 0.707. Under H₁, a wide range of conversion lifts are plausible before seeing the data.
Observed effect: Δp = 0.07 pp out of a base rate of ~4.85%. Standardized, this is an extremely small effect. The large sample size (n = 10,000 per arm) gives the test ample power to detect any practically meaningful difference.
Bayes Factor: With n = 10,000 per variant and such a tiny observed difference, the marginal likelihood calculation yields BF₀₁ = 22.0 — meaning BF₁₀ = 1/22.0 ≈ 0.045. The data favor H₀ by a factor of 22.
Interpretation: BF₀₁ = 22.0 falls in the 10–30 range — strong evidence for H₀. This is the power of Bayesian analysis: the team has not merely failed to detect a difference; they have positive evidence that no meaningful difference exists.
Decision: Stop the test and do not implement variant B. Report: "BF₀₁ = 22.0, providing strong evidence that the checkout redesign has no meaningful effect on conversion rates."
✅ Conclusion: The data are 22 times more likely under the null hypothesis of equivalence than under the alternative. The team can confidently conclude the redesign offers no conversion benefit — a decision a frequentist null result alone (which only says p ≥ 0.05) could not support with the same clarity.
Bayes Factor vs P-Value: A Direct Comparison
The p-value and the Bayes Factor answer fundamentally different questions. A p-value answers: "If H₀ is true, how likely is it that I would see data at least as extreme as mine?" A Bayes Factor answers: "Given the data I actually observed, how much more evidence do I have for H₁ than for H₀?" These are not the same thing, and conflating them causes real errors in scientific reasoning.
| Dimension | Bayes Factor | P-Value (NHST) | Likelihood Ratio |
|---|---|---|---|
| Question answered | How much more does the data support H₁ vs H₀? | How surprising is the data if H₀ is true? | What is the best-fit evidence ratio? |
| Support for H₀ | Yes — explicitly, via BF₀₁ > 1 | No — can only fail to reject H₀ | No |
| Prior information | Required under H₁ | Not used | Not used |
| Sensitivity to sample size | Lower — evidence can favor either H₀ or H₁ as n increases | High — even trivial effects can become statistically significant with very large n | Medium |
| Binary cutoff | No — continuous evidence scale | Yes — α = 0.05 is treated as a threshold | No |
| Sequential testing | Valid — update continuously with new data | Inflates Type I error — requires pre-planned n | Requires correction |
| Typical reporting | BF₁₀ = 18.5 (strong evidence for H₁) | p = 0.023 (significant at α = 0.05) | LR = 12.4 at θ̂ |
One practical consequence is that, with a very large sample, even a trivially small effect can yield p < 0.05. For example, a drug that lowers systolic blood pressure by only 0.3 mmHg may be declared statistically significant if the sample size is sufficiently large. A Bayes factor can behave differently: if the observed effect is much smaller than the effect sizes predicted under H₁, the evidence may favor H₀, yielding BF₁₀ < 1. In this way, Bayesian inference evaluates not only whether an effect differs from zero but also whether the observed magnitude is consistent with the predictions of the competing hypotheses.
Use p-values when following a pre-registered confirmatory protocol with a fixed sample size and you need to control long-run Type I error rates. Use Bayes Factors when you want to quantify evidence strength continuously, explicitly confirm null effects, or run adaptive designs. For more on the frequentist approach, see the p-values guide and hypothesis testing overview on Statistics Fundamentals.
Computing Bayes Factors in R and Python
R: The BayesFactor Package
The BayesFactor package by Morey and Rouder is the standard tool for Bayesian t-tests, ANOVA, and regression in R. Install it once with install.packages("BayesFactor").
library(BayesFactor) # Example data: control and intervention group scores control <- c(48, 52, 54, 50, 55, 49, 53, 51) intervention <- c(57, 61, 59, 63, 58, 60, 62, 64) # Run Bayesian independent-samples t-test # rscale = 0.707 is the default "medium" Cauchy prior bf_result <- ttestBF(x = control, y = intervention, rscale = 0.707) # Print the result print(bf_result) # Output: BF10 value + posterior probability of H1 # Extract the numeric BF10 value bf_value <- extractBF(bf_result)$bf cat("BF10 =", bf_value, "\nBF01 =", 1/bf_value, "\n") # One-sample test: is the mean different from 0? bf_one_sample <- ttestBF(x = control, mu = 50, rscale = 0.707) print(bf_one_sample)
# Bayesian Pearson correlation test x <- rnorm(50, mean = 5, sd = 2) y <- x * 0.6 + rnorm(50, mean = 0, sd = 1.5) bf_corr <- correlationBF(y = y, x = x) print(bf_corr) # Interprets evidence for non-zero correlation vs rho = 0
Python: Pingouin Library
The pingouin library provides Bayesian t-tests directly. Install with pip install pingouin.
import pingouin as pg import numpy as np # Generate example data np.random.seed(42) control = np.random.normal(52, 8, 40) intervention = np.random.normal(58, 8, 42) # Bayesian independent-samples t-test # Returns BF10 directly in the output table result = pg.ttest(intervention, control, correction=True) print(result[["BF10", "dof", "p-val", "cohen-d"]]) # One-sample Bayesian t-test result_one = pg.ttest(control, 50) print(result_one[["BF10", "T", "p-val"]]) # Convert BF10 to BF01 bf10 = result["BF10"].values[0] bf01 = 1 / bf10 print(f"BF10 = {bf10:.2f}, BF01 = {bf01:.4f}")
If you prefer a graphical interface over code, JASP (jasp-stats.org) is a free, open-source statistics program developed at the University of Amsterdam. It computes Bayes Factors for t-tests, ANOVA, regression, and correlations with one click, using the same Rouder priors as the R BayesFactor package. Results include the full posterior distribution and robustness checks across prior specifications.
Bayes Factor Calculator
This calculator implements the marginal likelihood ratio for a one-sample or two-sample scenario using a Cauchy prior on the standardized effect size. Enter your t-statistic, sample sizes, and choose the prior scale to compute BF₁₀ directly. For a full Bayesian t-test with raw data, use the R code above.
🧮 Interactive Bayes Factor Calculator
Real-World Applications of the Bayes Factor
Psychology & Replication
Bayes Factors have become the preferred tool for replication studies in psychology. They can show not just that a replication failed, but that the null is actively supported — a distinction classical tests cannot make.
Clinical Trials
In adaptive trial designs, Bayes Factors allow researchers to update evidence continuously without inflating error rates — making them useful for interim analyses where early stopping decisions must be made.
A/B Testing
E-commerce and product teams use Bayesian methods to stop tests early when evidence is decisive (BF > 10) or when there is strong evidence of equivalence (BF₀₁ > 10) — saving engineering resources and revenue.
Genetics & Biomarkers
Genome-wide association studies use Bayes Factors to evaluate the evidence for genetic association at each locus, naturally balancing the prior probability of association against observed effect sizes.
Machine Learning
Bayesian model comparison uses Bayes Factors to evaluate whether a more complex model is justified by the data. This prevents overfitting by requiring that added parameters earn their complexity through improved fit.
Economics & Finance
Macroeconomists use Bayes Factors to compare structural models. Financial analysts apply them to test whether new risk factors add predictive value beyond existing ones in factor models.
How to Report a Bayes Factor in Research
Academic reporting of Bayes Factors requires four elements: the specific statistic, its numerical value, the prior specification, and the qualitative interpretation. The examples below follow APA 7th edition style as adapted for Bayesian reporting per Keysers, Gazzola, and Wagenmakers (2020).
Reporting Templates
For evidence in favor of H₁:
"A Bayesian independent-samples t-test using a Cauchy prior on effect size (r = 0.707) showed that [H₁ description]: BF₁₀ = 18.5, which according to the Jeffreys scale constitutes strong evidence for the alternative hypothesis."
For evidence in favor of H₀:
"A Bayesian analysis with a Cauchy prior (r = 0.707) revealed that the data provided strong evidence against a treatment effect: BF₀₁ = 22.0. We conclude that the intervention does not meaningfully alter [outcome variable]."
For inconclusive results:
"The Bayesian t-test yielded BF₁₀ = 1.8, indicating anecdotal and inconclusive evidence. Neither hypothesis is well supported; further data collection is required."
Bayes Factors depend on the prior distribution chosen for H₁. Two researchers using different priors will get different Bayes Factors from identical data. This is not a flaw — it makes the influence of prior knowledge explicit — but it requires transparent reporting. State the prior family, its parameters, and the rationale for choosing it. If reporting in a journal that mandates this, the JASP manual and the Psychonomic Bulletin & Review style guide by Wagenmakers et al. (2018) provide detailed guidance.
Bayes Factor Cheat Sheet
| Term / Symbol | Definition | Formula / Note |
|---|---|---|
| BF₁₀ | Evidence for H₁ over H₀ | P(D|H₁) / P(D|H₀) |
| BF₀₁ | Evidence for H₀ over H₁ | 1 / BF₁₀ |
| Marginal likelihood | Probability of data under a hypothesis, integrated over all parameters | ∫ P(D|θ,H) · P(θ|H) dθ |
| Prior odds | Your belief ratio before seeing data | P(H₁) / P(H₀) |
| Posterior odds | Your belief ratio after seeing data | BF₁₀ × Prior Odds |
| Cauchy prior r = 0.707 | Default JASP/BayesFactor prior on standardized effect size | Gives 50% probability to |δ| > 0.707 |
| log BF | Natural log of the Bayes Factor — symmetric evidence scale | log(BF₁₀) > 0 favors H₁ |
| Savage-Dickey ratio | Shortcut for nested models | P(θ₀|H₁) / P(θ₀|D, H₁) |
| Jeffreys threshold | Minimum BF for "noteworthy" evidence | BF₁₀ ≥ 3 |
| Strong evidence | Jeffreys Scale: strong | BF₁₀ = 10–30 |
Frequently Asked Questions
Related Statistical Concepts
The Bayes Factor connects to a broader set of topics in Bayesian and frequentist statistics. The links below cover the foundational concepts that underpin Bayes Factor analysis, all from Statistics Fundamentals:
Bayes' Theorem
The mathematical foundation that links prior probability, likelihood, and posterior probability. The Bayes Factor is the likelihood ratio component in the model-comparison form of this theorem.
Hypothesis Testing
The frequentist framework for making decisions with statistical data. Understanding null hypothesis significance testing helps clarify exactly where and why Bayes Factors offer a different approach.
P-Values
The frequentist measure of evidence most commonly compared to the Bayes Factor. Reading both guides together reveals the different inferential goals each tool serves.
Conditional Probability
The marginal likelihood P(D|H) in the Bayes Factor formula is a conditional probability. This guide covers the foundational rules needed to understand how likelihoods are constructed and evaluated.
Effect Size
Cohen's d and other standardized effect sizes are the parameters the Bayes Factor integrates over when computing the marginal likelihood under H₁. Understanding effect sizes is essential for prior selection.
Pearson Correlation
The Bayesian correlation test computes a Bayes Factor for whether ρ = 0. This guide covers the frequentist version, which makes a useful comparison point for understanding what the Bayesian test adds.