What is the Bayes' Theorem formula?

The core formula is P(A|B) = [P(B|A) × P(A)] / P(B). The denominator P(B) is often expanded using the Law of Total Probability as P(B) = P(B|A)·P(A) + P(B|A^c)·P(A^c), where A^c is the complement of A.

What is posterior probability in Bayes' Theorem?

The posterior probability, written as P(A|B), is the updated probability of hypothesis A after incorporating the evidence B. It is the output of Bayes' Theorem. Before seeing the evidence, we have the prior P(A); after seeing it, the posterior P(A|B) is our revised belief.

What is the base rate fallacy?

The base rate fallacy occurs when people ignore the prior probability (base rate) of an event and focus only on the likelihood. A classic example: a medical test with 99% sensitivity can still yield mostly false positives if the disease is very rare (e.g., 0.1% prevalence). Bayes' Theorem corrects for this by multiplying the likelihood by the prior.

How is Bayes' Theorem used in machine learning?

Bayes' Theorem is the foundation of the Naive Bayes classifier, which assigns each data point to the class that maximizes the posterior probability. It is also the basis of Bayesian inference in probabilistic graphical models, Bayesian neural networks, and MCMC sampling methods used throughout statistical machine learning.

What is the difference between Bayesian and frequentist statistics?

Frequentist statistics treats probability as the long-run frequency of events in repeated experiments; parameters are fixed unknowns and data is random. Bayesian statistics treats probability as a degree of belief; parameters have their own probability distributions that are updated with data via Bayes' Theorem. In frequentist inference you compute p-values and confidence intervals; in Bayesian inference you compute posterior distributions and credible intervals.

Who invented Bayes' Theorem?

The theorem is named after the Reverend Thomas Bayes (1702–1761), whose unpublished essay was edited and presented to the Royal Society of London by Richard Price in 1763, two years after Bayes' death. Pierre-Simon Laplace independently developed the same result in a more general form in 1812.

What is prior probability?

Prior probability, written as P(A), is the probability assigned to a hypothesis before observing any new evidence. It encodes existing knowledge or a base rate — for example, the known prevalence of a disease in the general population before a diagnostic test is administered.

What is the Law of Total Probability?

The Law of Total Probability states that for any event B and a partition of the sample space into mutually exclusive events A and A^c: P(B) = P(B|A)·P(A) + P(B|A^c)·P(A^c). This formula gives the denominator in Bayes' Theorem when P(B) is not directly known.

Why does a positive test not always mean you have a disease?

Because the posterior probability P(Disease | Positive Test) depends not only on the test's sensitivity (true positive rate) but also on the disease's base rate. When a disease is rare, the sheer number of healthy people tested means even a small false positive rate generates more false positives than true positives, so most positive results come from healthy people — this is the false positive paradox, corrected by Bayes' Theorem.

Bayes' Theorem: Formula, Proof & Real-World Examples (2026)

Q: What is Bayes' Theorem?

Bayes' Theorem is a formula in probability theory that calculates the probability of a hypothesis given observed evidence, by combining a prior belief about the hypothesis with the likelihood of the evidence under that hypothesis. Written formally: P(A|B) = [P(B|A) × P(A)] / P(B), where P(A|B) is the posterior probability, P(B|A) is the likelihood, P(A) is the prior probability, and P(B) is the marginal likelihood (the total probability of the evidence).

What Is Bayes' Theorem? (Definition + Formula)

Definition — Bayes' Theorem (Bayes' Rule)

Bayes' Theorem gives the conditional probability of a hypothesis A being true given that evidence B has been observed. It works by multiplying the prior probability of the hypothesis by the likelihood of the evidence under that hypothesis, then dividing by the total probability of the evidence. The result, called the posterior probability, is the rational update of your belief after accounting for the data.

P(A|B) = [ P(B|A) × P(A) ] / P(B)

Think of it as the mathematics of being a good detective. Before you examine a crime scene you have a prior belief about each suspect based on motive and opportunity. Each new clue — a fingerprint, a receipt, a timeline — updates that belief. Bayes' Theorem is the exact arithmetic procedure for doing those updates correctly. Starting with a prior P(A), observing evidence B, and computing the posterior P(A|B) is the formal version of how rational minds revise beliefs.

The theorem connects three quantities you can often measure directly — the prior, the likelihood, and the false positive rate — to produce a fourth quantity, the posterior, that is genuinely difficult to estimate by intuition alone. This gap between what intuition guesses and what the math produces is the source of the base rate fallacy, covered in detail below.

⚡ Quick Reference — Bayes' Theorem Key Facts

Core formula: P(A|B) = [P(B|A) × P(A)] / P(B)
Extended denominator: P(B) = P(B|A)·P(A) + P(B|Aᶜ)·P(Aᶜ)
Posterior P(A|B): updated probability of hypothesis A after observing evidence B
Prior P(A): probability of hypothesis A before observing evidence (the base rate)
Likelihood P(B|A): probability of observing evidence B if hypothesis A is true
Marginal likelihood P(B): total probability of evidence B across all hypotheses
Named after: Reverend Thomas Bayes (1702–1761); formalized by Laplace (1812)

The Complete Formula Breakdown

The formula P(A|B) = [P(B|A) × P(A)] / P(B) has four named components, and reading each one clearly is the first skill to develop. The notation P(A|B) is read "the probability of A given B" — the vertical bar means "conditional on" or "given that."

Bayes' Theorem — Core Form

P(A|B) = [ P(B|A) × P(A) ] / P(B)

Extended Form — Law of Total Probability Denominator

P(A|B) = P(B|A)·P(A) / [ P(B|A)·P(A) + P(B|Aᶜ)·P(Aᶜ) ]

P(A|B) — Posterior probability

P(B|A) — Likelihood

P(A) — Prior probability

P(B) — Marginal likelihood (Evidence)

Aᶜ — Complement of A (A is false)

The Four Named Components

Output

Posterior Probability

P(A|B)

The probability that hypothesis A is true after observing evidence B. This is what you want to know. In a medical context: "given a positive test result, what is the probability the patient actually has the disease?"

Input 1

Prior Probability

P(A)

The unconditional probability of hypothesis A before any evidence is considered. Often the base rate — for example, the prevalence of a disease in the population. The prior is where existing knowledge enters the calculation.

Input 2

Likelihood

P(B|A)

The probability that the evidence B would be observed if the hypothesis A were true. In testing, this is called sensitivity or the true positive rate. It answers: "How often does this test fire when the condition exists?"

Normalizer

Marginal Likelihood

P(B)

The total probability of observing evidence B under all competing hypotheses. It acts as a normalizing constant — it ensures the posterior is a valid probability between 0 and 1. Most often computed via the Law of Total Probability.

The Law of Total Probability (Expanding the Denominator)

The denominator P(B) is often the part students find hardest to compute directly. When you do not know P(B) but you do know how B behaves under each possible state of the world, you use the Law of Total Probability. For two mutually exclusive and exhaustive cases — A is true, or A is false (Aᶜ) — the law gives:

Law of Total Probability

P(B) = P(B|A)·P(A) + P(B|Aᶜ)·P(Aᶜ)

P(B|A) — Probability of evidence if hypothesis is true

P(B|Aᶜ) — Probability of evidence if hypothesis is false (false positive rate)

P(Aᶜ) = 1 − P(A) — Complement of the prior

The first term P(B|A)·P(A) accounts for true positives — the evidence appearing because the hypothesis is correct. The second term P(B|Aᶜ)·P(Aᶜ) accounts for false positives — the evidence appearing even when the hypothesis is false. Summing both gives the total probability of observing the evidence across the entire population. For deeper grounding in conditional probability basics, see our foundational guide.

A Brief History of Bayes' Rule

The Reverend Thomas Bayes (1702–1761) was an English minister and mathematician who never published his result during his lifetime. His paper, An Essay towards Solving a Problem in the Doctrine of Chances, was communicated to the Royal Society of London in 1763 by his friend Richard Price, two years after Bayes' death. The essay addressed a specific inverse probability question: given a known number of successes in repeated trials, what can be inferred about the underlying probability of success?

Pierre-Simon Laplace independently derived the same result in a substantially more general form in his 1812 work Théorie analytique des probabilités. For over a century the rule was mostly a mathematical curiosity. Its practical adoption accelerated during World War II, when Alan Turing's team at Bletchley Park used Bayesian-style reasoning to break the German Enigma cipher — one of the earliest large-scale applications of sequential belief updating. The rise of computational statistics in the 1980s and 1990s made Bayesian methods tractable for complex models, and they now appear throughout machine learning, clinical trial design, and natural language processing.

Historical Citation

The Original Source

Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 370–418. Price, R. (ed.). The essay is freely accessible via the Royal Society's archives and JSTOR at jstor.org/stable/105741.

How to Apply Bayes' Theorem: 4 Steps

Every Bayesian calculation follows the same four-step procedure regardless of the domain. The steps below use a generic hypothesis A and evidence B as placeholders — the worked examples in the next section substitute real values.

Method

Four-Step Bayesian Update

State the prior P(A). Identify the base rate — the probability of the hypothesis before observing evidence. This comes from population data, historical records, or domain knowledge. For a disease, it is prevalence.

Identify the likelihood P(B|A). Determine how probable the observed evidence is assuming the hypothesis is true. For a diagnostic test, this is sensitivity (the true positive rate). For a classifier, it is the detection rate.

Compute the marginal likelihood P(B). Apply the Law of Total Probability: P(B) = P(B|A)·P(A) + P(B|Aᶜ)·(1−P(A)). You need the false positive rate P(B|Aᶜ) — how often the evidence appears when the hypothesis is false.

Divide to get the posterior P(A|B). P(A|B) = [P(B|A) × P(A)] / P(B). This result is your updated belief in the hypothesis after accounting for all the evidence.

Three Fully Worked Examples

Example 1 — Medical Screening (The False Positive Paradox)

This is the canonical Bayesian example because the result surprises virtually everyone the first time. A disease affects 1% of the population. A test for it has 95% sensitivity (it correctly detects 95% of true cases) and 5% false positive rate (it incorrectly flags 5% of healthy people). A patient tests positive. What is the probability she actually has the disease?

Disease prevalence P(D)

95%

Sensitivity P(+|D)

False positive rate P(+|¬D)

Posterior P(D|+)

Worked Example — Medical Test

What is P(Disease | Positive Test)?

Prior: P(Disease) = 0.01. This is the base rate — 1% of the population has the disease.

Likelihood: P(Positive | Disease) = 0.95. The test detects 95% of true cases.

Marginal likelihood:
P(Positive) = P(+|Disease)·P(Disease) + P(+|Healthy)·P(Healthy)
= (0.95 × 0.01) + (0.05 × 0.99)
= 0.0095 + 0.0495
= 0.059

Posterior:
P(Disease | Positive) = (0.95 × 0.01) / 0.059 = 0.0095 / 0.059 ≈ 0.161

✓ P(Disease | Positive Test) ≈ 16.1% — not 95%. Despite a 95%-sensitive test, a positive result means there is roughly a 1-in-6 chance the patient is ill. The other ~84% of positive tests are false positives coming from the large healthy majority of the population.

⚠️

Base Rate Fallacy — The Classic Trap

Most people, including physicians, answer "~95%" to this question because they focus on the test's accuracy and mentally ignore the prior. This is the base rate fallacy. Kahneman and Tversky documented this cognitive bias extensively. Research published by Harvard Medical School faculty (Casscells, Schoenberger, and Graboys, 1978, New England Journal of Medicine) found that fewer than 20% of Harvard medical students and staff answered an equivalent problem correctly. The prior P(A) is not optional — it changes everything.

Visualizing 1,000 People Tested

Each icon represents one person. This shows why most positive tests are false positives when prevalence is low.

Sick + Positive test (True Positive ≈ 10)

Sick + Negative test (False Negative ≈ 0)

Healthy + Positive test (False Positive ≈ 50)

Healthy + Negative test (True Negative ≈ 940)

Visualization based on a population of 1,000. Disease prevalence = 1%, sensitivity = 95%, false positive rate = 5%. Counts rounded for display. Method consistent with Gigerenzer et al. (2011), Journal of Clinical Epidemiology — natural frequency visualization for medical decision making.

Example 2 — Spam Filtering and Naive Bayes

Email spam filters use a direct application of Bayes' Theorem. A Naive Bayes classifier asks: given that an email contains the word "free," what is the probability it is spam? The "naive" assumption is that each word's presence is independent of every other word given the class — an approximation that nonetheless performs well in practice and remains a standard baseline in text classification.

Suppose historical data shows: 40% of all incoming email is spam, the word "free" appears in 60% of spam emails, and "free" appears in 10% of legitimate (ham) emails.

Worked Example — Spam Filter

P(Spam | email contains "free")?

Prior: P(Spam) = 0.40. 40% of received emails are spam based on training data.

Likelihood: P("free" | Spam) = 0.60. The word "free" appears in 60% of known spam.

Marginal likelihood:
P("free") = P("free"|Spam)·P(Spam) + P("free"|Ham)·P(Ham)
= (0.60 × 0.40) + (0.10 × 0.60)
= 0.24 + 0.06
= 0.30

Posterior:
P(Spam | "free") = (0.60 × 0.40) / 0.30 = 0.24 / 0.30 = 0.80

✓ P(Spam | "free") = 80%. The word "free" alone pushes the probability of spam from the prior 40% to an updated 80%. A production filter chains this update across every word in the message — computing P(Spam | word₁, word₂, …, wordₙ) iteratively.

💡

From Bayes to Naive Bayes Classifier

In a Naive Bayes classifier used in machine learning, the posterior over class C given features x₁…xₙ is: P(C|x₁…xₙ) ∝ P(C) × ∏ P(xᵢ|C). The prior P(C) is estimated from class frequency in training data; each conditional P(xᵢ|C) is estimated from feature frequency within that class. The class with the highest posterior is the prediction. This approach forms the basis of many text classification, sentiment analysis, and document categorization systems. For more on classification methods see our logistic regression guide.

Example 3 — Legal Evidence (The Defendant's Fallacy)

Courts regularly encounter probabilistic evidence — DNA match statistics, blood type matches, fiber analysis. Bayes' Theorem determines how much such evidence should move the probability of guilt. The defendant's fallacy is the mirror-image error of the prosecutor's fallacy: the defense argues that because 1 in 1,000,000 people share a DNA profile, and there are millions of people in the country, the evidence means nothing. Both are wrong in opposite directions; the correct approach is Bayesian.

Suppose a DNA profile matching the defendant's occurs in 1 in 500,000 people. Before DNA evidence, the probability of guilt based on other evidence is estimated at 1 in 10,000 (0.01%). After a DNA match, what is the posterior probability of guilt?

Worked Example — Legal Evidence

P(Guilty | DNA Match)?

Prior: P(Guilty) = 0.0001 (1 in 10,000 based on pre-DNA evidence).

Likelihood: P(DNA Match | Guilty) = 1.0 (if guilty, the DNA always matches).

False match rate: P(DNA Match | Innocent) = 1/500,000 = 0.000002.
P(Match) = (1.0 × 0.0001) + (0.000002 × 0.9999) ≈ 0.0001 + 0.000002 = 0.000102

Posterior:
P(Guilty | Match) = (1.0 × 0.0001) / 0.000102 ≈ 0.980 (98%)

✓ A DNA match that occurs by chance in 1 per 500,000 people moves the prior probability of guilt from 0.01% to approximately 98% — a dramatic, but correctly quantified update. Note how sensitive this is to the prior: if the prior were lower (suspect chosen from a city of 5 million with no other evidence), the posterior would fall to about 50%. The prior is not a technicality; it is mathematically fundamental.

⚠️

The Prosecutor's Fallacy

The prosecutor's fallacy confuses P(Evidence | Innocent) with P(Innocent | Evidence). Just because a DNA profile occurs in 1 in 500,000 people does not mean there is only a 1-in-500,000 chance the defendant is innocent — that ignores the prior. The UK Court of Appeal has explicitly ruled that presenting statistics this way to juries is improper (R v Deen, 1993; R v Adams, 1996). The correct framework is Bayesian. For background, see the Alan Turing Institute's work on statistics in legal proceedings.

Interactive Bayes' Theorem Calculator

Enter the three inputs below — prior probability, sensitivity (true positive rate), and false positive rate — and the calculator returns the posterior probability, false positive probability, and full working. The results update instantly.

Posterior Probability Calculator

Prior Probability P(A) — e.g., 0.01 for 1%

Sensitivity / True Positive Rate P(B|A)

False Positive Rate P(B|Aᶜ)

—

Posterior probability P(A|B)

—

Entity & Formula Glossary

The table below maps every variable in Bayes' Theorem to its technical name, notation, plain-language description, and typical source of the value in practice. Use this as a reference when reading probability textbooks, machine learning papers, or clinical study reports.

Term	Notation	Plain-language meaning	How it is typically obtained
Posterior Probability	P(A\|B)	Updated probability of hypothesis A after observing evidence B. The output of Bayes' Theorem.	Computed via Bayes' formula; not directly observed.
Prior Probability	P(A)	Initial probability of hypothesis A before any evidence. Encodes existing knowledge or the base rate.	Population data, historical frequencies, literature prevalence rates.
Likelihood	P(B\|A)	Probability of observing evidence B if the hypothesis A is true.	Clinical trial data, test validation studies, detection rates.
Marginal Likelihood / Evidence	P(B)	Total probability of observing evidence B across all states of the world. Normalizes the posterior.	Law of Total Probability: P(B\|A)·P(A) + P(B\|Aᶜ)·P(Aᶜ).
False Positive Rate	P(B\|Aᶜ)	Probability of the evidence appearing when the hypothesis is false. Also called 1 − specificity.	Test validation studies; complement of specificity.
Base Rate	P(A)	The unconditional frequency of an event in the broader population. Identical to the prior in most applied problems.	Epidemiological databases, census data, historical records.
Complement Event	Aᶜ or ¬A	The event "A does not occur." Always P(Aᶜ) = 1 − P(A).	Algebraically derived from the prior.
Sensitivity (Recall)	P(B\|A)	In diagnostic testing: the proportion of true cases that produce a positive test. In ML: true positive rate.	Clinical validation studies; computed as TP / (TP + FN).
Specificity	1 − P(B\|Aᶜ)	The proportion of negative cases correctly identified as negative. Complement of the false positive rate.	Clinical validation; computed as TN / (TN + FP).

The Base Rate Fallacy — Why Intuition Fails

The base rate fallacy is the systematic error of updating probability based on the likelihood alone while ignoring the prior. It appears in clinical diagnosis, judicial reasoning, security screening, and everyday decision-making.

🚨

Caution — The Base Rate Fallacy in Action

A disease test has 99% sensitivity and 99% specificity. The disease prevalence is 0.1% (1 in 1,000). A positive test result means the posterior probability of disease is approximately 9% — not 99%. The 99% accuracy figure applies to the test's behavior, not to what a positive result means for an individual from a low-prevalence population. Researchers at the Harding Center for Risk Literacy (associated with the Max Planck Institute) document how natural frequency framing — imagining 1,000 people rather than percentages — dramatically reduces this reasoning error in clinical settings.

The root cause is intuitive probability matching: people tend to match their estimate of P(A|B) to the value of P(B|A), treating the two as symmetric. They are not symmetric. The relationship between them is precisely what Bayes' Theorem encodes: P(A|B) = P(B|A) × P(A) / P(B). When P(A) is small, P(A|B) will be small even if P(B|A) is large. Understanding conditional probability at a basic level before approaching Bayes reduces this confusion significantly.

Bayesian vs Frequentist Statistics

Two schools of thought have defined modern statistics, and the tension between them sits at the philosophical heart of what Bayes' Theorem means. The debate is not merely semantic — it produces different tools, different inferences, and sometimes different conclusions from the same data.

Dimension	Bayesian Statistics	Frequentist Statistics
Definition of probability	A degree of belief or certainty that can apply to any proposition, including one-time events.	The long-run relative frequency of an event in repeated identical experiments.
Parameters	Parameters are random variables with probability distributions.	Parameters are fixed but unknown constants.
Data	Data is fixed once observed; the prior is updated to a posterior.	Data is a random sample from a hypothetical infinite set of experiments.
Prior knowledge	Explicitly incorporated through the prior distribution.	Not formally incorporated; only the current data is analyzed.
Inference output	Posterior distribution over parameters; credible intervals with direct probability interpretation.	Point estimates, p-values, and confidence intervals (which do not directly state probability of the parameter).
Hypothesis testing	Posterior odds and Bayes factors compare hypotheses directly.	Null hypothesis significance testing (NHST); reject or fail to reject H₀ at a significance level α.
Computational burden	Often requires MCMC or variational inference for complex models.	Often closed-form solutions; computationally simpler for standard tests.
Use in ML	Bayesian networks, Gaussian processes, variational autoencoders, Bayesian optimization.	Maximum likelihood estimation, ordinary least squares, regularized regression.

Neither framework is universally superior. Frequentist methods are computationally efficient and well-understood when data are abundant and priors are contentious. Bayesian methods are natural when data are scarce, when prior knowledge is reliable and should not be discarded, or when you want a posterior distribution rather than a point estimate. For a more detailed treatment of frequentist hypothesis testing see our hypothesis testing guide.

Academic Reference

Gelman et al. — The Definitive Bayesian Reference

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press. This is the standard graduate-level text used in Bayesian statistics courses at MIT, Stanford, and Columbia. The first chapter provides a rigorous treatment of the Bayesian paradigm versus frequentist alternatives. Free lecture notes are available via Gelman's BDA3 resource page.

Bayes' Theorem in Machine Learning

Bayesian reasoning runs through machine learning at multiple levels. The Naive Bayes classifier is the most explicit application, but the concept of a posterior over model parameters underlies Bayesian neural networks, Gaussian processes, and the entire subfield of probabilistic programming.

The Naive Bayes Classifier

A Naive Bayes classifier assigns a data point to the class C* that maximizes the posterior probability:

Naive Bayes — Classification Rule

C* = argmax_C [ P(C) × ∏ P(xᵢ | C) ]

P(C) — Class prior (estimated from training data)

P(xᵢ|C) — Feature likelihood per class

Naive assumption: features are conditionally independent given class

Despite its independence assumption almost always being false in practice, Naive Bayes is competitive on text classification benchmarks and is still deployed in production spam filters, news categorizers, and sentiment analyzers. Its speed and interpretability make it a useful baseline. The Stanford NLP Group's foundational text by Jurafsky and Martin (Speech and Language Processing) dedicates a full chapter to Naive Bayes as a gateway to statistical NLP — accessible at web.stanford.edu/~jurafsky/slp3/.

Bayesian Inference in Model Training

In full Bayesian inference, model parameters θ are treated as random variables with a prior distribution P(θ). After observing training data D, the posterior over parameters is:

Bayesian Parameter Inference

P(θ | D) = P(D | θ) × P(θ) / P(D)

P(θ|D) — Posterior over parameters

P(D|θ) — Likelihood of data given parameters

P(θ) — Prior over parameters

P(D) — Model evidence (normalizing constant)

Computing P(D) exactly requires integrating over all possible parameter values, which is intractable for most real models. This is why Markov Chain Monte Carlo (MCMC) sampling and variational inference methods exist — they approximate the posterior without computing the normalizing constant directly. For practical applications, see the logistic regression article on maximum likelihood estimation as the frequentist counterpart.

Bayesian Tree Diagram

A probability tree diagram makes the two-branch structure of Bayes' Theorem concrete. The diagram branches first on the hypothesis (A true or A false), then on the evidence (B observed or not). Each path through the tree carries a joint probability — the product of the probabilities along that path. The posterior is the proportion of the "B observed" paths that went through the "A true" branch.

Bayesian Tree — Medical Test Example (1% prevalence, 95% sensitivity, 5% FPR)

Population (1000 people)
│
├─── Disease (1%) ──→ 10 people
│    │
│    ├─── Positive test (95%) ──→ 9.5 people  [TRUE POSITIVE]
│    │
│    └─── Negative test (5%)  ──→ 0.5 people  [false negative]
│
└─── Healthy (99%) ──→ 990 people
     │
     ├─── Positive test (5%)  ──→ 49.5 people [FALSE POSITIVE]
     │
     └─── Negative test (95%) ──→ 940.5 people [true negative]

Total positives: 9.5 + 49.5 = 59 people
P(Disease | Positive) = 9.5 / 59 ≈ 16.1%

Numbers shown for a hypothetical group of 1,000 people. Natural frequency framing makes the base rate fallacy immediately visible: of 59 positive tests, only 9.5 are true.

Bayes' Theorem Cheat Sheet

The table below is a copy-pasteable reference mapping the equation to notation, name, plain meaning, and a medical-context example for each component. This format is designed for LLM readability and AI extraction.

Component Name	Notation	Plain Meaning	In the Medical Test Example
Posterior Probability	P(A\|B)	Probability of hypothesis A given evidence B has been observed. The answer you seek.	P(Disease \| Positive Test) = 16.1%
Prior Probability	P(A)	Probability of A before any evidence. The base rate.	P(Disease) = 1% = 0.01
Likelihood	P(B\|A)	How probable is the evidence if the hypothesis is true? Sensitivity for tests.	P(Positive \| Disease) = 95% = 0.95
False Positive Rate	P(B\|Aᶜ)	How often does evidence appear even when the hypothesis is false?	P(Positive \| Healthy) = 5% = 0.05
Complement Prior	P(Aᶜ) = 1 − P(A)	Probability the hypothesis is false.	P(Healthy) = 99% = 0.99
Marginal Likelihood	P(B)	Total probability of the evidence across all hypotheses.	P(Positive) = (0.95×0.01) + (0.05×0.99) = 0.059
Full Formula	P(A\|B) = P(B\|A)·P(A) / P(B)	Posterior = (Likelihood × Prior) / Evidence	= (0.95 × 0.01) / 0.059 ≈ 0.161
Extended Denominator	P(B) = P(B\|A)·P(A) + P(B\|Aᶜ)·P(Aᶜ)	Law of Total Probability expansion for two mutually exclusive cases.	= (0.95×0.01) + (0.05×0.99) = 0.059

Frequently Asked Questions

FAQ 1

What is Bayes' Theorem in simple terms?

Bayes' Theorem is a formula for updating a probability when new information arrives. You start with a prior probability — your best estimate before any evidence — and multiply it by how likely that evidence is under your hypothesis. Then you divide by the total probability of the evidence across all possibilities. The result is your posterior: a revised, evidence-informed probability. The detective analogy holds: start with a suspect list, gather clues, and Bayes tells you exactly how much each clue should change your suspicion of each suspect.

FAQ 2

When should you use the extended form of Bayes' Theorem?

Use the extended form — P(A|B) = P(B|A)·P(A) / [P(B|A)·P(A) + P(B|Aᶜ)·P(Aᶜ)] — whenever P(B) is not directly known from data. In practice this means almost always. You typically know the sensitivity P(B|A) and the false positive rate P(B|Aᶜ) from test validation studies, and you know P(A) from population data. The extended denominator lets you compute P(B) from those three ingredients.

FAQ 3

What is the difference between the prior and the likelihood?

The prior P(A) is about the hypothesis in the absence of evidence — it is determined before the experiment. The likelihood P(B|A) is about the evidence given the hypothesis — it is determined by the measurement process (a test, a sensor, an observation). Confusing them produces the prosecutor's fallacy: treating P(Evidence | Innocent) as though it equals P(Innocent | Evidence). They are related by Bayes' Theorem, but they are not equal and are rarely even close.

FAQ 4

How do you choose a prior in Bayesian statistics?

Prior selection is one of the most discussed topics in Bayesian statistics. Informative priors encode domain knowledge — for example, using published disease prevalence data as P(Disease). Weakly informative priors (e.g., wide Gaussian distributions on parameters) constrain values to be reasonable without strongly pulling the posterior. Uninformative or "flat" priors assign equal probability to all parameter values and let data dominate. Jeffreys priors are invariant to reparameterization. The choice matters most when data are scarce; with large datasets, the likelihood overwhelms any reasonable prior.

FAQ 5

Can Bayes' Theorem be applied repeatedly as new evidence arrives?

Yes — and sequential updating is one of the most useful properties of the Bayesian framework. Once you compute the posterior from the first piece of evidence, that posterior becomes the new prior for the next piece of evidence. This sequential Bayesian updating is the basis of the Kalman filter (used in GPS and robotics), Bayesian online learning, and sequential clinical trial designs where treatment decisions update continuously as patient data accumulates.

Sources & Further Reading

📚

Academic Citations

The following sources were consulted in preparing this reference. Linking to authoritative academic sources improves verifiability and supports accurate citation chains across AI systems.

Source	Relevance
Bayes, T. (1763). An Essay Towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society, 53, 370–418.	Original theorem. JSTOR
Gelman, A. et al. (2013). Bayesian Data Analysis (3rd ed.). CRC Press. Chapters 1–3 cover prior specification, likelihood, and posterior computation.	Graduate-level Bayesian statistics textbook (MIT, Stanford, Columbia curricula).
Casscells, W., Schoenberger, A., & Graboys, T. (1978). Interpretation by physicians of clinical laboratory results. New England Journal of Medicine, 299(18), 999–1001. doi:10.1056/NEJM197811022991808	Empirical study showing base rate neglect among Harvard physicians — foundational reference for the base rate fallacy.
Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. M., & Woloshin, S. (2007). Helping Doctors and Patients Make Sense of Health Statistics. Psychological Science in the Public Interest, 8(2), 53–96. doi	Definitive study on natural frequency framing as a corrective for base rate fallacy in medical settings.
Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing (3rd ed., draft). Stanford NLP. Chapter 4: Naive Bayes and Sentiment Classification. web.stanford.edu/~jurafsky/slp3/	Standard NLP textbook; Chapter 4 derives Naive Bayes from first principles for text classification.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131. doi	Foundational paper documenting base rate neglect and related probability heuristics.
MIT OpenCourseWare. 18.650 — Fundamentals of Statistics. ocw.mit.edu	MIT course covering Bayesian vs frequentist inference; freely accessible lecture notes and problem sets.
Stanford Encyclopedia of Philosophy. Bayes' Theorem. plato.stanford.edu/entries/bayes-theorem/	Philosophical and mathematical treatment; covers conditional probability axioms and interpretations.

Bayes' Theorem builds directly on conditional probability and connects outward to hypothesis testing, regression, and machine learning. The pages below form the natural learning path before and after this topic on Statistics Fundamentals:

Prerequisite