What is the difference between likelihood and probability?

Probability calculates the chance of future unobserved data outcomes occurring based on known, fixed parameters. Likelihood evaluates past, observed data outcomes to estimate or compare unknown parameter values. Probability sums or integrates to 1 over the data domain; likelihood has no such constraint over the parameter space.

What is maximum likelihood estimation?

Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a statistical model by finding the parameter values θ-hat that maximize the likelihood function L(θ | x), thereby making the observed data as probable as possible under the chosen model.

What is log likelihood?

Log likelihood is the natural logarithm of the likelihood function, expressed as ln L(θ | x). It converts the computationally complex multiplication of probabilities into a mathematically tractable sum of logarithms, simplifying optimization without changing the location of the maximum.

How do you interpret a likelihood value?

Likelihood is interpreted relatively, not absolutely. A single value of 0.8 does not mean an 80% chance. If L(θ₁|x) / L(θ₂|x) = 3, it means parameter value θ₁ is three times more supported by the observed data than θ₂.

Likelihood in Statistics: Definition, Formula & Complete Guide (2026)

Q: What is likelihood in statistics?

Likelihood in statistics is a function that measures the plausibility of a set of statistical parameters given a specific set of observed data. It reverses the directional lens of probability by holding the data fixed and treating the model parameters as variables. Written L(θ | x), it equals the joint probability of the data evaluated across the parameter space.

What Is Likelihood in Statistics?

Definition — Likelihood

Likelihood is a statistical measure that quantifies how plausible a specific set of parameters is, given an observed set of data. While probability predicts unknown outcomes from known parameters, likelihood reverses this process: it evaluates varying hypothetical parameter values against data that has already been observed.

L(θ | x) = f(x | θ) evaluated as a function of θ

The core idea is a directional flip. In probability, you ask: "If the parameter is θ, what is the chance of seeing data x?" In likelihood, you ask: "Given that I observed data x, how plausible is parameter θ?" The math uses the same joint probability expression — but the roles of data and parameter have swapped.

Formally, given an observed data vector x and a statistical model parameterized by θ, the likelihood function L(θ | x) is numerically equal to the joint probability distribution f(x | θ), but it is treated as a function of θ with x held fixed. This is not a probability distribution over θ — it does not integrate to 1 over the parameter space, and individual likelihood values cannot be read as probabilities.

The concept was formalized by Ronald Fisher in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," published in the Philosophical Transactions of the Royal Society. Fisher deliberately separated likelihood from inverse probability to create a frequentist framework for parameter estimation that did not require a subjective prior distribution. His work laid the groundwork for essentially all of modern statistical modeling and is covered in depth in the statistics and probability section of Statistics Fundamentals.

⚡ Quick Reference — Likelihood Key Facts

Notation: L(θ | x) — read as "the likelihood of θ given observed data x"
Numerically equals: f(x | θ), the joint probability of the data, but viewed as a function of θ
Not a PDF: Does not integrate to 1 over the parameter space
Interpreted relatively: Compare L(θ₁ | x) to L(θ₂ | x), not as absolute probabilities
Used for: Maximum Likelihood Estimation, hypothesis testing, and Bayesian updating
In ML: Minimizing negative log-likelihood is equivalent to maximizing likelihood

Likelihood vs. Probability: The Core Difference

The distinction between likelihood and probability trips up even experienced analysts. Both involve the same mathematical object — f(x | θ) — but they ask different questions about it.

💡

The One-Line Contrast

Probability fixes θ, asks about x. Likelihood fixes x, asks about θ.

Data Space vs. Parameter Space

Probability operates on the data space. You know the parameters (a fair coin has p = 0.5), and you compute the chance of each possible outcome (HH, HT, TT, etc.). The probabilities of all outcomes sum to 1.

Likelihood operates on the parameter space. You observed data (say, 7 heads in 10 flips), and you compute the likelihood across different candidate values of p — is p = 0.5 or p = 0.7 more consistent with what you saw? The likelihoods across all parameter values do not sum to 1.

This matters in practice because a likelihood function can take any non-negative value and comparison between two likelihood values is what carries information — the ratio L(θ₁ | x) / L(θ₂ | x) tells you how much more the data favors θ₁ over θ₂.

Why Likelihood Is Not a Probability Distribution

A probability density function satisfies the normalization constraint over the data domain:

Probability Normalization Constraint

∫ f(x | θ) dx = 1

Integration is over the data space Parameters θ are held fixed

When the same expression is treated as a function of θ with x fixed, this integral over the parameter space has no such constraint:

Likelihood — No Normalization Constraint

∫ L(θ | x) dθ ≠ 1 (in general)

Integration is over the parameter space Data x is held fixed

This is why a likelihood value of 0.8 does not mean "80% probable." It is simply the height of the likelihood function at a particular parameter value. What matters is how this height compares to the height at other parameter values.

Probability vs. Likelihood Comparison Table

Feature	Probability	Likelihood
Mathematical notation	P(X = x \| θ)	L(θ \| x)
Unknown variable	Observed data (x)	Model parameters (θ)
Fixed component	Parameters (θ)	Observed data (x)
Sums / integrates to 1?	Yes — over all data outcomes	No — not over the parameter space
Direction of inference	Deductive: known θ → predict x	Inductive: observed x → estimate θ
Primary use	Predict future outcomes	Estimate unknown parameters
Interpretation	Absolute — a single value has meaning	Relative — only ratios carry information
Example question	"Given p = 0.6, what's the chance of 7 heads?"	"Given 7 heads, how plausible is p = 0.6 vs. p = 0.7?"

The Likelihood Function and Formula

General Likelihood Formula

For a dataset x = {x₁, x₂, …, xₙ} of independent and identically distributed (i.i.d.) observations drawn from a distribution f(x | θ), the likelihood function is the joint probability of all observations treated as a function of θ:

Likelihood Function — i.i.d. Data

L(θ | x) = ∏ᵢ₌₁ⁿ f(xᵢ | θ)

θ = parameter(s) to estimate x = observed data vector f(xᵢ | θ) = PDF or PMF at each observation ∏ = product over all n observations

Because individual probabilities are often small numbers (between 0 and 1), their product across many observations can become extremely small — small enough to cause numerical underflow in a computer. This is why statisticians almost always work with the log-likelihood instead.

Log-Likelihood and Negative Log-Likelihood

The log-likelihood is the natural logarithm of the likelihood function. Because logarithm is a monotonically increasing function, the parameter value that maximizes L(θ | x) also maximizes ln L(θ | x). The two optimization problems have the same answer.

Log-Likelihood Formula

ℓ(θ | x) = ln L(θ | x) = Σᵢ₌₁ⁿ ln f(xᵢ | θ)

ℓ = log-likelihood (lowercase L) Σ = sum replaces product ln = natural logarithm

In machine learning, optimization algorithms (gradient descent and its variants) minimize functions rather than maximize them. So the log-likelihood is flipped in sign to produce the negative log-likelihood (NLL):

Negative Log-Likelihood (NLL)

NLL = −ℓ(θ | x) = −Σᵢ₌₁ⁿ ln f(xᵢ | θ)

Minimizing NLL = maximizing likelihood Used as loss function in neural networks

✓

Cross-Entropy Connection

Cross-entropy loss in classification neural networks is precisely the negative log-likelihood of a categorical distribution. Minimizing cross-entropy is equivalent to finding maximum likelihood estimates for the model weights.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is the procedure of finding the parameter values θ̂ that make the observed data as probable as possible under the chosen model. It is the most common approach to parameter estimation in both classical statistics and modern machine learning.

MLE — The Objective

θ̂_MLE = arg max_θ L(θ | x) = arg max_θ ℓ(θ | x)

θ̂ = MLE estimate (theta-hat) arg max = parameter value achieving the maximum

How to Calculate MLE: Step-by-Step

Collect and Structure the Data

Gather your observed data points x = {x₁, x₂, …, xₙ} and verify the independence assumption. Each observation should be drawn independently from the same underlying distribution.

Choose the Statistical Model

Select a probability distribution — Normal, Binomial, Poisson, Exponential — that matches the data-generating mechanism. The likelihood function is built from this distribution's PDF or PMF. See the normal distribution and binomial distribution guides for distribution details.

Write the Joint Likelihood Function

Multiply together the individual probabilities for each observation: L(θ | x) = ∏ f(xᵢ | θ). For continuous distributions, f is the PDF; for discrete distributions, it is the PMF.

Take the Log-Likelihood

Convert the product to a sum: ℓ(θ | x) = Σ ln f(xᵢ | θ). This is numerically more stable and analytically easier to differentiate.

Differentiate and Set Equal to Zero

Take the partial derivative of ℓ(θ | x) with respect to each parameter, set each derivative to zero, and solve the resulting equations (called the score equations or likelihood equations).

Verify the Maximum

Check that the second derivative is negative at the solution (confirming it is a maximum, not a minimum or saddle point). This is equivalent to checking that the second derivative of the log-likelihood is negative, or that the Hessian matrix is negative definite for multiple parameters.

Worked Examples: Calculating Likelihood

Example 1: Binomial Distribution (Coin Toss)

Worked Example 1 — Binomial Likelihood

Problem: You flip a coin 10 times and observe k = 7 heads. Using the binomial distribution, find the likelihood of the success probability p at three candidate values: p = 0.5, p = 0.7, and p = 0.9. Which parameter value does the data most support?

Binomial Likelihood Function

L(p | n, k) = C(n,k) · pᵏ · (1−p)ⁿ⁻ᵏ

n = 10 (total trials) k = 7 (observed successes) p = candidate success probability C(10,7) = 120

Set up: n = 10, k = 7, C(10,7) = 120. We evaluate L(p | 10, 7) at three candidate values of p.

At p = 0.5:
L = 120 × (0.5)⁷ × (0.5)³ = 120 × 0.0078125 × 0.125 = 120 × 0.000977 = 0.1172

At p = 0.7:
L = 120 × (0.7)⁷ × (0.3)³ = 120 × 0.08235 × 0.027 = 120 × 0.002224 = 0.2668

At p = 0.9:
L = 120 × (0.9)⁷ × (0.1)³ = 120 × 0.4783 × 0.001 = 120 × 0.0004783 = 0.0574

Interpret: The MLE is θ̂ = k/n = 7/10 = 0.7. At this value L(0.7) = 0.2668, which is the peak. The likelihood ratio L(0.7)/L(0.5) ≈ 2.28, meaning the data is 2.28× more consistent with p = 0.7 than with p = 0.5.

✅ Result: The MLE is p̂ = 0.7. The data is 2.28 times more consistent with p = 0.7 than a fair coin (p = 0.5), and 4.65 times more consistent than p = 0.9. Likelihood comparisons, not individual values, drive inference.

Binomial distribution formula: NIST Engineering Statistics Handbook, Section 1.3.6.6.18. Fisher's likelihood theory: Fisher, R.A. (1922). On the Mathematical Foundations of Theoretical Statistics. Phil. Trans. R. Soc. London A, 222, 309–368.

Example 2: Normal Distribution Parameter Estimation

Worked Example 2 — Normal Likelihood and MLE

Problem: A quality engineer measures the diameter of 5 ball bearings (in mm): x = {10.1, 9.9, 10.2, 10.0, 10.3}. Using a normal distribution model, derive the MLE estimates for the mean μ and variance σ².

Normal Log-Likelihood

ℓ(μ,σ² | x) = −(n/2)ln(2πσ²) − Σ(xᵢ−μ)² / (2σ²)

n = sample size μ = population mean σ² = population variance

Data: x = {10.1, 9.9, 10.2, 10.0, 10.3}, n = 5.

MLE for μ: Differentiate ℓ with respect to μ and set to zero. The score equation gives μ̂_MLE = x̄ = (10.1 + 9.9 + 10.2 + 10.0 + 10.3) / 5 = 10.1 mm. The MLE of the mean is always the sample mean for normal data.

MLE for σ²: Differentiate ℓ with respect to σ² and set to zero. This yields σ̂²_MLE = Σ(xᵢ − x̄)² / n (divides by n, not n−1).

Compute: Deviations from 10.1: (0.0², −0.2², 0.1², −0.1², 0.2²) = (0.00, 0.04, 0.01, 0.01, 0.04). Sum = 0.10. σ̂²_MLE = 0.10/5 = 0.02 mm². σ̂_MLE = √0.02 ≈ 0.141 mm.

Note on bias: The MLE variance estimator divides by n, not n−1. It is slightly biased for small samples. The unbiased sample variance s² = Σ(xᵢ − x̄)² / (n−1) = 0.10/4 = 0.025 mm² is preferred for inference. See the variance guide for details.

✅ Results: μ̂_MLE = 10.1 mm, σ̂²_MLE = 0.02 mm² (biased), s² = 0.025 mm² (unbiased). For normal data, the MLE mean equals the sample mean — a direct, intuitive result from the likelihood framework.

Example 3: Logistic Regression and Binary Classification

Worked Example 3 — Logistic Regression Likelihood

Problem: A simple logistic regression model predicts y ∈ {0, 1} from a single input x. With weights β₀ = −3 and β₁ = 2, compute the log-likelihood for three training observations: (x=1, y=1), (x=2, y=1), (x=0, y=0).

Logistic Regression Likelihood

p̂ᵢ = 1 / (1 + e^−(β₀ + β₁xᵢ))

ℓ = Σ [yᵢ ln(p̂ᵢ) + (1−yᵢ)ln(1−p̂ᵢ)]

Obs 1 (x=1, y=1): p̂ = 1/(1+e^(−(−3+2))) = 1/(1+e^1) = 1/(1+2.718) = 0.269. Contribution: 1×ln(0.269) = −1.312

Obs 2 (x=2, y=1): p̂ = 1/(1+e^(−(−3+4))) = 1/(1+e^(−1)) = 1/(1+0.368) = 0.731. Contribution: 1×ln(0.731) = −0.313

Obs 3 (x=0, y=0): p̂ = 1/(1+e^(−(−3+0))) = 1/(1+e^3) = 1/(1+20.09) = 0.047. Contribution: (1−0)×ln(1−0.047) = ln(0.953) = −0.048

Total log-likelihood: ℓ = −1.312 + (−0.313) + (−0.048) = −1.673. During training, gradient descent would adjust β₀ and β₁ to bring this value closer to 0 (maximize log-likelihood = minimize NLL).

✅ Log-likelihood = −1.673. The model fits poorly for Observation 1 (p̂ = 0.269 for an actual y=1 event) but well for Observations 2 and 3. Training adjusts β₀ and β₁ to maximize ℓ — this is precisely what gradient descent does in any binary classification model. See the logistic regression guide for the full training procedure.

The Likelihood Ratio and Likelihood Ratio Test

What Is a Likelihood Ratio?

A likelihood ratio compares two competing parameter values or models by dividing one likelihood by the other. It answers the question: "How many times more consistent is the data with θ₁ than with θ₀?"

Likelihood Ratio

Λ = L(θ₁ | x) / L(θ₀ | x)

Λ > 1 means θ₁ fits data better than θ₀ Λ < 1 means θ₀ fits data better than θ₁ Λ = 1 means equal support from the data

The Likelihood Ratio Test (LRT)

The Likelihood Ratio Test compares two nested statistical models — a simpler null model (H₀, with fewer parameters) and a more complex alternative model (H₁, with additional parameters). Neyman and Pearson showed that the LRT is the most powerful test for comparing two simple hypotheses, a result known as the Neyman-Pearson lemma.

Likelihood Ratio Test Statistic

λ = −2 ln [L(θ̂₀ | x) / L(θ̂₁ | x)] = −2 [ℓ(θ̂₀) − ℓ(θ̂₁)]

λ follows χ² distribution df = difference in number of free parameters Reject H₀ if λ > χ²_α,df

Under H₀, the test statistic λ is approximately chi-square distributed with degrees of freedom equal to the difference in the number of free parameters between the two models. This lets you use the chi-square table to find the critical value for your chosen significance level. For a formal treatment, see the hypothesis testing guide.

Likelihood in Bayesian Inference

In Bayesian statistics, likelihood plays a specific structural role in Bayes' theorem. It is the mechanism that converts prior beliefs about parameters into posterior beliefs after observing data.

Bayes' Theorem

P(θ | x) ∝ L(θ | x) × P(θ)

P(θ | x) = posterior distribution L(θ | x) = likelihood (data evidence) P(θ) = prior distribution

The prior P(θ) represents what you believed about the parameter before seeing the data. The likelihood L(θ | x) = P(x | θ) is the evidence from the data. Multiplying them gives a quantity proportional to the posterior — your updated belief after observing x. A detailed treatment of this updating process is in the Bayes' theorem guide.

⚠️

Frequentist vs. Bayesian Use of Likelihood

In frequentist statistics (MLE), likelihood is used alone — no prior is involved. In Bayesian statistics, likelihood is combined with a prior. The likelihood function is the same mathematical object in both frameworks; only what you do with it differs.

How to Interpret Likelihood Values

Likelihood values are interpreted by comparison, not in isolation. The table below covers the most common scenarios you will encounter when reading or reporting likelihood-based analyses.

Situation	What It Means	Practical Conclusion
L(θ₁ \| x) > L(θ₂ \| x)	θ₁ explains the data better than θ₂	Prefer θ₁ as a parameter estimate
Λ = L(θ₁)/L(θ₂) = 3	Data is 3× more consistent with θ₁ than θ₂	Moderate evidence favoring θ₁
ℓ(θ \| x) → 0	Joint probability of data approaches 1	Near-perfect model fit
ℓ(θ \| x) → −∞	Model cannot explain the observed data	Poor fit — wrong distribution or parameters
NLL decreasing during training	Model is fitting the training data better	Training is progressing correctly
LRT statistic λ > χ²_α,df	Adding parameters significantly improves fit	Reject the simpler (null) model

Real-World Applications of Likelihood

🧬

Clinical Trials & Survival Analysis

Cox proportional hazards models use partial likelihood to estimate how covariates (treatment assignment, age, dosage) affect patient survival time without specifying the baseline hazard function.

📈

Financial Risk Modeling

ARCH and GARCH models estimate volatility clustering in financial returns using MLE. The likelihood function captures the conditional variance structure across time.

🤖

Natural Language Processing

Autoregressive language models (GPT-style) are trained by maximizing the conditional log-likelihood of each next token given prior context — the product of all these conditional probabilities is the sequence likelihood.

🔬

Genetics & Phylogenetics

Maximum likelihood methods reconstruct evolutionary trees by finding the tree topology and branch lengths that make the observed gene sequences most probable under a substitution model.

🏭

Quality Control

Engineers use MLE to estimate the parameters of failure-time distributions (Weibull, exponential) from censored lifetime data, predicting product reliability and warranty costs.

🧠

Bayesian Machine Learning

Variational autoencoders (VAEs) optimize the evidence lower bound (ELBO), which is a tractable lower bound on the log-likelihood of the observed data under the generative model.

Binomial Log-Likelihood Calculator

Enter the number of trials, observed successes, and candidate success probability p. The calculator returns the likelihood L(p | n, k) and log-likelihood ℓ(p | n, k) for that parameter value, along with the MLE estimate p̂ = k/n.

🧮 Binomial Likelihood Calculator

Number of Trials (n)

Observed Successes (k)

Candidate p value

Quick-Reference Formula Tables

Likelihood Metrics Summary

Concept	Formula	Primary Purpose
Likelihood Function	L(θ \| x) = ∏ f(xᵢ \| θ)	Quantifies relative support for parameter values
Log-Likelihood	ℓ(θ \| x) = Σ ln f(xᵢ \| θ)	Numerically stable form for optimization
Negative Log-Likelihood	NLL = −ℓ(θ \| x)	Standard minimization loss function in ML
MLE	θ̂ = arg max_θ L(θ \| x)	Finds the best-fitting parameter values
Likelihood Ratio	Λ = L(θ₁ \| x) / L(θ₀ \| x)	Compares support for two parameter values
LRT Statistic	λ = −2 ln Λ	Formal hypothesis test between nested models
Bayesian Posterior	P(θ \| x) ∝ L(θ \| x) × P(θ)	Combines likelihood with prior belief

MLE Formulas for Common Distributions

Distribution	Parameter	MLE Formula	Equals
Normal	Mean μ	μ̂ = (Σxᵢ) / n	Sample mean x̄
Normal	Variance σ²	σ̂² = Σ(xᵢ − x̄)² / n	Biased variance (divides by n)
Binomial	Success prob p	p̂ = k / n	Sample proportion
Poisson	Rate λ	λ̂ = x̄ = (Σxᵢ) / n	Sample mean
Exponential	Rate λ	λ̂ = n / (Σxᵢ)	Reciprocal of sample mean

FAQs

Likelihood in statistics is a function that measures how plausible a set of parameter values is given observed data. Written as L(θ | x), it reverses the usual direction of probability by treating data as fixed and parameters as variables. It is the foundation of maximum likelihood estimation and plays a central role in Bayesian inference.

A likelihood function, L(θ | x), is the joint probability of the observed data expressed as a function of unknown parameters. For independent observations, it is written as the product of individual densities: L(θ | x) = ∏ f(xᵢ | θ). The data remain fixed while θ varies across possible values.

Probability predicts the likelihood of future data given fixed parameters. Likelihood evaluates how well different parameter values explain observed data. Both use the same mathematical form f(x | θ), but probability treats x as variable, while likelihood treats θ as variable.

Choose a statistical model, plug in parameter values, and compute the joint probability of the observed data. For independent observations, multiply probabilities across all data points: L(θ | x) = ∏ f(xᵢ | θ). In practice, the log-likelihood is used: ℓ(θ | x) = Σ ln f(xᵢ | θ).

Likelihood is interpreted comparatively, not absolutely. A higher likelihood means a parameter value better explains the data than another. Ratios of likelihoods are meaningful, while raw likelihood values alone do not represent probabilities.

Maximum likelihood estimation (MLE) finds the parameter values that maximize the likelihood function. These values make the observed data most probable under the model. In practice, the log-likelihood is maximized because it is mathematically simpler and numerically stable.

Log likelihood is the natural logarithm of the likelihood function. It converts products into sums, making calculations easier and more stable. Maximizing log likelihood gives the same result as maximizing likelihood.

Many machine learning models are trained by maximizing likelihood or minimizing negative log likelihood. This includes logistic regression, neural networks, and probabilistic models. Cross-entropy loss is a form of negative log likelihood for classification tasks.

The likelihood ratio test compares two nested models by evaluating their likelihoods. It measures whether a more complex model significantly improves fit over a simpler one. The test statistic is based on the ratio of maximized likelihoods and follows a chi-square distribution under the null hypothesis.

The likelihood principle states that all information about a parameter is contained in the likelihood function. Different experiments that produce proportional likelihoods should lead to the same inference, regardless of how the data were collected.

Entity and Formula Glossary

Term	Formula	Definition
Likelihood	L(θ \| x)	Plausibility measure for parameter values given observed data
Likelihood Function	L(θ \| x) = ∏ f(xᵢ \| θ)	Joint probability of data, treated as a function of the parameters
Log-Likelihood	ℓ(θ \| x) = Σ ln f(xᵢ \| θ)	Natural log of the likelihood — used for numerical stability
Negative Log-Likelihood	NLL = −ℓ(θ \| x)	Loss function minimized during machine learning training
MLE	θ̂ = arg max_θ L(θ \| x)	Parameter values making the observed data most probable
Likelihood Ratio	Λ = L(θ₁ \| x) / L(θ₀ \| x)	Relative support for two competing parameter values
Score Function	s(θ) = ∂ℓ / ∂θ	First derivative of log-likelihood; zero at the MLE
Fisher Information	I(θ) = −E[∂²ℓ / ∂θ²]	Curvature of the log-likelihood; governs MLE variance via Cramér-Rao
Prior Distribution	P(θ)	Beliefs about θ before observing data; used in Bayesian inference
Posterior Distribution	P(θ \| x) ∝ L(θ \| x) × P(θ)	Updated beliefs about θ after observing data

Sources and Further Reading

Fisher, R.A. (1922). On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London. Series A, 222, 309–368. doi:10.1098/rsta.1922.0009

Casella, G. and Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury Press. ISBN 978-0534243128. The standard graduate-level reference for likelihood theory, MLE, and hypothesis testing.

NIST/SEMATECH e-Handbook of Statistical Methods. Maximum Likelihood. itl.nist.gov

Neyman, J. and Pearson, E.S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337. — Original paper establishing the Neyman-Pearson lemma and the likelihood ratio test.

Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. probml.github.io — Comprehensive treatment of likelihood in the machine learning context.

Stanford Encyclopedia of Philosophy: Likelihood. plato.stanford.edu/entries/statistics/ — Covers philosophical foundations including the likelihood principle.