What Is Likelihood in Statistics?
The core idea is a directional flip. In probability, you ask: "If the parameter is θ, what is the chance of seeing data x?" In likelihood, you ask: "Given that I observed data x, how plausible is parameter θ?" The math uses the same joint probability expression — but the roles of data and parameter have swapped.
Formally, given an observed data vector x and a statistical model parameterized by θ, the likelihood function L(θ | x) is numerically equal to the joint probability distribution f(x | θ), but it is treated as a function of θ with x held fixed. This is not a probability distribution over θ — it does not integrate to 1 over the parameter space, and individual likelihood values cannot be read as probabilities.
The concept was formalized by Ronald Fisher in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," published in the Philosophical Transactions of the Royal Society. Fisher deliberately separated likelihood from inverse probability to create a frequentist framework for parameter estimation that did not require a subjective prior distribution. His work laid the groundwork for essentially all of modern statistical modeling and is covered in depth in the statistics and probability section of Statistics Fundamentals.
- Notation: L(θ | x) — read as "the likelihood of θ given observed data x"
- Numerically equals: f(x | θ), the joint probability of the data, but viewed as a function of θ
- Not a PDF: Does not integrate to 1 over the parameter space
- Interpreted relatively: Compare L(θ₁ | x) to L(θ₂ | x), not as absolute probabilities
- Used for: Maximum Likelihood Estimation, hypothesis testing, and Bayesian updating
- In ML: Minimizing negative log-likelihood is equivalent to maximizing likelihood
Likelihood vs. Probability: The Core Difference
The distinction between likelihood and probability trips up even experienced analysts. Both involve the same mathematical object — f(x | θ) — but they ask different questions about it.
Probability fixes θ, asks about x. Likelihood fixes x, asks about θ.
Data Space vs. Parameter Space
Probability operates on the data space. You know the parameters (a fair coin has p = 0.5), and you compute the chance of each possible outcome (HH, HT, TT, etc.). The probabilities of all outcomes sum to 1.
Likelihood operates on the parameter space. You observed data (say, 7 heads in 10 flips), and you compute the likelihood across different candidate values of p — is p = 0.5 or p = 0.7 more consistent with what you saw? The likelihoods across all parameter values do not sum to 1.
This matters in practice because a likelihood function can take any non-negative value and comparison between two likelihood values is what carries information — the ratio L(θ₁ | x) / L(θ₂ | x) tells you how much more the data favors θ₁ over θ₂.
Why Likelihood Is Not a Probability Distribution
A probability density function satisfies the normalization constraint over the data domain:
data space
Parameters θ are held fixed
When the same expression is treated as a function of θ with x fixed, this integral over the parameter space has no such constraint:
parameter space
Data x is held fixed
This is why a likelihood value of 0.8 does not mean "80% probable." It is simply the height of the likelihood function at a particular parameter value. What matters is how this height compares to the height at other parameter values.
Probability vs. Likelihood Comparison Table
| Feature | Probability | Likelihood |
|---|---|---|
| Mathematical notation | P(X = x | θ) | L(θ | x) |
| Unknown variable | Observed data (x) | Model parameters (θ) |
| Fixed component | Parameters (θ) | Observed data (x) |
| Sums / integrates to 1? | Yes — over all data outcomes | No — not over the parameter space |
| Direction of inference | Deductive: known θ → predict x | Inductive: observed x → estimate θ |
| Primary use | Predict future outcomes | Estimate unknown parameters |
| Interpretation | Absolute — a single value has meaning | Relative — only ratios carry information |
| Example question | "Given p = 0.6, what's the chance of 7 heads?" | "Given 7 heads, how plausible is p = 0.6 vs. p = 0.7?" |
The Likelihood Function and Formula
General Likelihood Formula
For a dataset x = {x₁, x₂, …, xₙ} of independent and identically distributed (i.i.d.) observations drawn from a distribution f(x | θ), the likelihood function is the joint probability of all observations treated as a function of θ:
θ = parameter(s) to estimate
x = observed data vector
f(xᵢ | θ) = PDF or PMF at each observation
∏ = product over all n observations
Because individual probabilities are often small numbers (between 0 and 1), their product across many observations can become extremely small — small enough to cause numerical underflow in a computer. This is why statisticians almost always work with the log-likelihood instead.
Log-Likelihood and Negative Log-Likelihood
The log-likelihood is the natural logarithm of the likelihood function. Because logarithm is a monotonically increasing function, the parameter value that maximizes L(θ | x) also maximizes ln L(θ | x). The two optimization problems have the same answer.
ℓ = log-likelihood (lowercase L)
Σ = sum replaces product
ln = natural logarithm
In machine learning, optimization algorithms (gradient descent and its variants) minimize functions rather than maximize them. So the log-likelihood is flipped in sign to produce the negative log-likelihood (NLL):
maximizing likelihood
Used as loss function in neural networks
Cross-entropy loss in classification neural networks is precisely the negative log-likelihood of a categorical distribution. Minimizing cross-entropy is equivalent to finding maximum likelihood estimates for the model weights.
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation is the procedure of finding the parameter values θ̂ that make the observed data as probable as possible under the chosen model. It is the most common approach to parameter estimation in both classical statistics and modern machine learning.
θ̂ = MLE estimate (theta-hat)
arg max = parameter value achieving the maximum
How to Calculate MLE: Step-by-Step
Collect and Structure the Data
Gather your observed data points x = {x₁, x₂, …, xₙ} and verify the independence assumption. Each observation should be drawn independently from the same underlying distribution.
Choose the Statistical Model
Select a probability distribution — Normal, Binomial, Poisson, Exponential — that matches the data-generating mechanism. The likelihood function is built from this distribution's PDF or PMF. See the normal distribution and binomial distribution guides for distribution details.
Write the Joint Likelihood Function
Multiply together the individual probabilities for each observation: L(θ | x) = ∏ f(xᵢ | θ). For continuous distributions, f is the PDF; for discrete distributions, it is the PMF.
Take the Log-Likelihood
Convert the product to a sum: ℓ(θ | x) = Σ ln f(xᵢ | θ). This is numerically more stable and analytically easier to differentiate.
Differentiate and Set Equal to Zero
Take the partial derivative of ℓ(θ | x) with respect to each parameter, set each derivative to zero, and solve the resulting equations (called the score equations or likelihood equations).
Verify the Maximum
Check that the second derivative is negative at the solution (confirming it is a maximum, not a minimum or saddle point). This is equivalent to checking that the second derivative of the log-likelihood is negative, or that the Hessian matrix is negative definite for multiple parameters.
Worked Examples: Calculating Likelihood
Example 1: Binomial Distribution (Coin Toss)
Problem: You flip a coin 10 times and observe k = 7 heads. Using the binomial distribution, find the likelihood of the success probability p at three candidate values: p = 0.5, p = 0.7, and p = 0.9. Which parameter value does the data most support?
n = 10 (total trials)
k = 7 (observed successes)
p = candidate success probability
C(10,7) = 120
Set up: n = 10, k = 7, C(10,7) = 120. We evaluate L(p | 10, 7) at three candidate values of p.
At p = 0.5:
L = 120 × (0.5)⁷ × (0.5)³ = 120 × 0.0078125 × 0.125 = 120 × 0.000977 = 0.1172
At p = 0.7:
L = 120 × (0.7)⁷ × (0.3)³ = 120 × 0.08235 × 0.027 = 120 × 0.002224 = 0.2668
At p = 0.9:
L = 120 × (0.9)⁷ × (0.1)³ = 120 × 0.4783 × 0.001 = 120 × 0.0004783 = 0.0574
Interpret: The MLE is θ̂ = k/n = 7/10 = 0.7. At this value L(0.7) = 0.2668, which is the peak. The likelihood ratio L(0.7)/L(0.5) ≈ 2.28, meaning the data is 2.28× more consistent with p = 0.7 than with p = 0.5.
✅ Result: The MLE is p̂ = 0.7. The data is 2.28 times more consistent with p = 0.7 than a fair coin (p = 0.5), and 4.65 times more consistent than p = 0.9. Likelihood comparisons, not individual values, drive inference.
Example 2: Normal Distribution Parameter Estimation
Problem: A quality engineer measures the diameter of 5 ball bearings (in mm): x = {10.1, 9.9, 10.2, 10.0, 10.3}. Using a normal distribution model, derive the MLE estimates for the mean μ and variance σ².
n = sample size
μ = population mean
σ² = population variance
Data: x = {10.1, 9.9, 10.2, 10.0, 10.3}, n = 5.
MLE for μ: Differentiate ℓ with respect to μ and set to zero. The score equation gives μ̂_MLE = x̄ = (10.1 + 9.9 + 10.2 + 10.0 + 10.3) / 5 = 10.1 mm. The MLE of the mean is always the sample mean for normal data.
MLE for σ²: Differentiate ℓ with respect to σ² and set to zero. This yields σ̂²_MLE = Σ(xᵢ − x̄)² / n (divides by n, not n−1).
Compute: Deviations from 10.1: (0.0², −0.2², 0.1², −0.1², 0.2²) = (0.00, 0.04, 0.01, 0.01, 0.04). Sum = 0.10. σ̂²_MLE = 0.10/5 = 0.02 mm². σ̂_MLE = √0.02 ≈ 0.141 mm.
Note on bias: The MLE variance estimator divides by n, not n−1. It is slightly biased for small samples. The unbiased sample variance s² = Σ(xᵢ − x̄)² / (n−1) = 0.10/4 = 0.025 mm² is preferred for inference. See the variance guide for details.
✅ Results: μ̂_MLE = 10.1 mm, σ̂²_MLE = 0.02 mm² (biased), s² = 0.025 mm² (unbiased). For normal data, the MLE mean equals the sample mean — a direct, intuitive result from the likelihood framework.
Example 3: Logistic Regression and Binary Classification
Problem: A simple logistic regression model predicts y ∈ {0, 1} from a single input x. With weights β₀ = −3 and β₁ = 2, compute the log-likelihood for three training observations: (x=1, y=1), (x=2, y=1), (x=0, y=0).
ℓ = Σ [yᵢ ln(p̂ᵢ) + (1−yᵢ)ln(1−p̂ᵢ)]
Obs 1 (x=1, y=1): p̂ = 1/(1+e^(−(−3+2))) = 1/(1+e^1) = 1/(1+2.718) = 0.269. Contribution: 1×ln(0.269) = −1.312
Obs 2 (x=2, y=1): p̂ = 1/(1+e^(−(−3+4))) = 1/(1+e^(−1)) = 1/(1+0.368) = 0.731. Contribution: 1×ln(0.731) = −0.313
Obs 3 (x=0, y=0): p̂ = 1/(1+e^(−(−3+0))) = 1/(1+e^3) = 1/(1+20.09) = 0.047. Contribution: (1−0)×ln(1−0.047) = ln(0.953) = −0.048
Total log-likelihood: ℓ = −1.312 + (−0.313) + (−0.048) = −1.673. During training, gradient descent would adjust β₀ and β₁ to bring this value closer to 0 (maximize log-likelihood = minimize NLL).
✅ Log-likelihood = −1.673. The model fits poorly for Observation 1 (p̂ = 0.269 for an actual y=1 event) but well for Observations 2 and 3. Training adjusts β₀ and β₁ to maximize ℓ — this is precisely what gradient descent does in any binary classification model. See the logistic regression guide for the full training procedure.
The Likelihood Ratio and Likelihood Ratio Test
What Is a Likelihood Ratio?
A likelihood ratio compares two competing parameter values or models by dividing one likelihood by the other. It answers the question: "How many times more consistent is the data with θ₁ than with θ₀?"
Λ > 1 means θ₁ fits data better than θ₀
Λ < 1 means θ₀ fits data better than θ₁
Λ = 1 means equal support from the data
The Likelihood Ratio Test (LRT)
The Likelihood Ratio Test compares two nested statistical models — a simpler null model (H₀, with fewer parameters) and a more complex alternative model (H₁, with additional parameters). Neyman and Pearson showed that the LRT is the most powerful test for comparing two simple hypotheses, a result known as the Neyman-Pearson lemma.
λ follows χ² distribution
df = difference in number of free parameters
Reject H₀ if λ > χ²_α,df
Under H₀, the test statistic λ is approximately chi-square distributed with degrees of freedom equal to the difference in the number of free parameters between the two models. This lets you use the chi-square table to find the critical value for your chosen significance level. For a formal treatment, see the hypothesis testing guide.
Likelihood in Bayesian Inference
In Bayesian statistics, likelihood plays a specific structural role in Bayes' theorem. It is the mechanism that converts prior beliefs about parameters into posterior beliefs after observing data.
P(θ | x) = posterior distribution
L(θ | x) = likelihood (data evidence)
P(θ) = prior distribution
The prior P(θ) represents what you believed about the parameter before seeing the data. The likelihood L(θ | x) = P(x | θ) is the evidence from the data. Multiplying them gives a quantity proportional to the posterior — your updated belief after observing x. A detailed treatment of this updating process is in the Bayes' theorem guide.
In frequentist statistics (MLE), likelihood is used alone — no prior is involved. In Bayesian statistics, likelihood is combined with a prior. The likelihood function is the same mathematical object in both frameworks; only what you do with it differs.
How to Interpret Likelihood Values
Likelihood values are interpreted by comparison, not in isolation. The table below covers the most common scenarios you will encounter when reading or reporting likelihood-based analyses.
| Situation | What It Means | Practical Conclusion |
|---|---|---|
| L(θ₁ | x) > L(θ₂ | x) | θ₁ explains the data better than θ₂ | Prefer θ₁ as a parameter estimate |
| Λ = L(θ₁)/L(θ₂) = 3 | Data is 3× more consistent with θ₁ than θ₂ | Moderate evidence favoring θ₁ |
| ℓ(θ | x) → 0 | Joint probability of data approaches 1 | Near-perfect model fit |
| ℓ(θ | x) → −∞ | Model cannot explain the observed data | Poor fit — wrong distribution or parameters |
| NLL decreasing during training | Model is fitting the training data better | Training is progressing correctly |
| LRT statistic λ > χ²_α,df | Adding parameters significantly improves fit | Reject the simpler (null) model |
Real-World Applications of Likelihood
Clinical Trials & Survival Analysis
Cox proportional hazards models use partial likelihood to estimate how covariates (treatment assignment, age, dosage) affect patient survival time without specifying the baseline hazard function.
Financial Risk Modeling
ARCH and GARCH models estimate volatility clustering in financial returns using MLE. The likelihood function captures the conditional variance structure across time.
Natural Language Processing
Autoregressive language models (GPT-style) are trained by maximizing the conditional log-likelihood of each next token given prior context — the product of all these conditional probabilities is the sequence likelihood.
Genetics & Phylogenetics
Maximum likelihood methods reconstruct evolutionary trees by finding the tree topology and branch lengths that make the observed gene sequences most probable under a substitution model.
Quality Control
Engineers use MLE to estimate the parameters of failure-time distributions (Weibull, exponential) from censored lifetime data, predicting product reliability and warranty costs.
Bayesian Machine Learning
Variational autoencoders (VAEs) optimize the evidence lower bound (ELBO), which is a tractable lower bound on the log-likelihood of the observed data under the generative model.
Binomial Log-Likelihood Calculator
Enter the number of trials, observed successes, and candidate success probability p. The calculator returns the likelihood L(p | n, k) and log-likelihood ℓ(p | n, k) for that parameter value, along with the MLE estimate p̂ = k/n.
🧮 Binomial Likelihood Calculator
Quick-Reference Formula Tables
Likelihood Metrics Summary
| Concept | Formula | Primary Purpose |
|---|---|---|
| Likelihood Function | L(θ | x) = ∏ f(xᵢ | θ) | Quantifies relative support for parameter values |
| Log-Likelihood | ℓ(θ | x) = Σ ln f(xᵢ | θ) | Numerically stable form for optimization |
| Negative Log-Likelihood | NLL = −ℓ(θ | x) | Standard minimization loss function in ML |
| MLE | θ̂ = arg max_θ L(θ | x) | Finds the best-fitting parameter values |
| Likelihood Ratio | Λ = L(θ₁ | x) / L(θ₀ | x) | Compares support for two parameter values |
| LRT Statistic | λ = −2 ln Λ | Formal hypothesis test between nested models |
| Bayesian Posterior | P(θ | x) ∝ L(θ | x) × P(θ) | Combines likelihood with prior belief |
MLE Formulas for Common Distributions
| Distribution | Parameter | MLE Formula | Equals |
|---|---|---|---|
| Normal | Mean μ | μ̂ = (Σxᵢ) / n | Sample mean x̄ |
| Normal | Variance σ² | σ̂² = Σ(xᵢ − x̄)² / n | Biased variance (divides by n) |
| Binomial | Success prob p | p̂ = k / n | Sample proportion |
| Poisson | Rate λ | λ̂ = x̄ = (Σxᵢ) / n | Sample mean |
| Exponential | Rate λ | λ̂ = n / (Σxᵢ) | Reciprocal of sample mean |
FAQs
Likelihood in statistics is a function that measures how plausible a set of parameter values is given observed data. Written as L(θ | x), it reverses the usual direction of probability by treating data as fixed and parameters as variables. It is the foundation of maximum likelihood estimation and plays a central role in Bayesian inference.
A likelihood function, L(θ | x), is the joint probability of the observed data expressed as a function of unknown parameters. For independent observations, it is written as the product of individual densities: L(θ | x) = ∏ f(xᵢ | θ). The data remain fixed while θ varies across possible values.
Probability predicts the likelihood of future data given fixed parameters. Likelihood evaluates how well different parameter values explain observed data. Both use the same mathematical form f(x | θ), but probability treats x as variable, while likelihood treats θ as variable.
Choose a statistical model, plug in parameter values, and compute the joint probability of the observed data. For independent observations, multiply probabilities across all data points: L(θ | x) = ∏ f(xᵢ | θ). In practice, the log-likelihood is used: ℓ(θ | x) = Σ ln f(xᵢ | θ).
Likelihood is interpreted comparatively, not absolutely. A higher likelihood means a parameter value better explains the data than another. Ratios of likelihoods are meaningful, while raw likelihood values alone do not represent probabilities.
Maximum likelihood estimation (MLE) finds the parameter values that maximize the likelihood function. These values make the observed data most probable under the model. In practice, the log-likelihood is maximized because it is mathematically simpler and numerically stable.
Log likelihood is the natural logarithm of the likelihood function. It converts products into sums, making calculations easier and more stable. Maximizing log likelihood gives the same result as maximizing likelihood.
Many machine learning models are trained by maximizing likelihood or minimizing negative log likelihood. This includes logistic regression, neural networks, and probabilistic models. Cross-entropy loss is a form of negative log likelihood for classification tasks.
The likelihood ratio test compares two nested models by evaluating their likelihoods. It measures whether a more complex model significantly improves fit over a simpler one. The test statistic is based on the ratio of maximized likelihoods and follows a chi-square distribution under the null hypothesis.
The likelihood principle states that all information about a parameter is contained in the likelihood function. Different experiments that produce proportional likelihoods should lead to the same inference, regardless of how the data were collected.
Entity and Formula Glossary
| Term | Formula | Definition |
|---|---|---|
| Likelihood | L(θ | x) | Plausibility measure for parameter values given observed data |
| Likelihood Function | L(θ | x) = ∏ f(xᵢ | θ) | Joint probability of data, treated as a function of the parameters |
| Log-Likelihood | ℓ(θ | x) = Σ ln f(xᵢ | θ) | Natural log of the likelihood — used for numerical stability |
| Negative Log-Likelihood | NLL = −ℓ(θ | x) | Loss function minimized during machine learning training |
| MLE | θ̂ = arg max_θ L(θ | x) | Parameter values making the observed data most probable |
| Likelihood Ratio | Λ = L(θ₁ | x) / L(θ₀ | x) | Relative support for two competing parameter values |
| Score Function | s(θ) = ∂ℓ / ∂θ | First derivative of log-likelihood; zero at the MLE |
| Fisher Information | I(θ) = −E[∂²ℓ / ∂θ²] | Curvature of the log-likelihood; governs MLE variance via Cramér-Rao |
| Prior Distribution | P(θ) | Beliefs about θ before observing data; used in Bayesian inference |
| Posterior Distribution | P(θ | x) ∝ L(θ | x) × P(θ) | Updated beliefs about θ after observing data |