What Are Type I and Type II Errors?
Every hypothesis test results in one of four outcomes. Either the null hypothesis H₀ is true or it is false, and either you reject it or you do not. Two of those four outcomes are correct decisions. The other two are errors — and those errors have names.
These two error types were formalized by statisticians Jerzy Neyman and Egon Pearson in their landmark 1933 paper as part of the Neyman–Pearson decision framework. Their insight was that a test should be designed not just to detect effects, but to control the rates of both kinds of mistakes. The complete context lives in the hypothesis testing reference on Statistics Fundamentals.
Type I Error — Definition
A Type I error occurs when the data leads you to reject H₀, but H₀ is actually true. You have detected an effect that does not exist. In diagnostic language, this is a false positive.
The probability of a Type I error is exactly equal to the significance level α that you set before running the test. If you set α = 0.05, then in repeated testing under a true null hypothesis, 5% of your tests will produce a false positive purely by chance. Choosing a smaller α (say, 0.01) makes Type I errors rarer but does not eliminate them.
Type II Error — Definition
A Type II error occurs when the data does not give enough evidence to reject H₀, but H₀ is false — a real effect exists and the test missed it. In diagnostic language, this is a false negative.
The probability of a Type II error is denoted β (beta). It depends on the significance level, the sample size, the actual effect size, and the variability in the data. Unlike α, β is not set directly; it is determined by these factors combined. The complement of β is statistical power: Power = 1 − β.
- Type I error (false positive): Reject H₀ when H₀ is true. Probability = α
- Type II error (false negative): Fail to reject H₀ when H₀ is false. Probability = β
- Correct rejection: Reject H₀ when H₀ is false. Probability = 1 − β = Power
- Correct retention: Fail to reject H₀ when H₀ is true. Probability = 1 − α = Specificity
- Memory rule: Type I = "false alarm." Type II = "missed detection."
The 2×2 Decision Matrix
Every statistical decision falls into one of four cells in this matrix. The rows represent your decision; the columns represent reality. Two cells are correct outcomes and two cells are errors.
| H₀ Is True (No real effect) |
H₀ Is False (Real effect exists) |
|
|---|---|---|
| Reject H₀ |
❌ Type I Error False Positive P = α |
✅ Correct Decision True Positive P = 1 − β = Power |
| Fail to Reject H₀ |
✅ Correct Decision True Negative P = 1 − α |
⚠️ Type II Error False Negative P = β |
Reading the table: the left column shows what happens when there truly is no effect. The right column shows what happens when a real effect exists. In the left column, you want to "fail to reject H₀" (bottom cell). In the right column, you want to "reject H₀" (top cell). The two error cells are the ones where your decision does not match reality.
P(Reject H₀ | H₀ true)
P(Keep H₀ | H₀ false)
P(Reject H₀ | H₀ false)
P(Keep H₀ | H₀ true)
Type I vs Type II Errors — Comparison
| Feature | Type I Error | Type II Error |
|---|---|---|
| Alternative name | False positive | False negative |
| What happens | Reject a true null hypothesis | Fail to reject a false null hypothesis |
| Symbol | α (alpha) | β (beta) |
| Directly controlled by | The chosen significance level | Sample size, effect size, α, variability |
| Typical default value | α = 0.05 | β = 0.20 (Power = 0.80) |
| Reduced by | Lowering α | Increasing n, increasing α, larger effect size |
| Analogy | Convicting an innocent person | Acquitting a guilty person |
| In medicine (screening) | Diagnosing healthy patient as sick | Missing a disease in a sick patient |
| Relationship to power | — | Power = 1 − β |
| Effect of increasing α | Increases | Decreases |
| Effect of larger n | Unchanged (still equals α) | Decreases β (raises power) |
Alpha, Beta, and Statistical Power
The three quantities α, β, and power are not independent. Once you fix the significance level, the sample size, and the effect size, the value of β — and therefore power — is determined. Understanding how they interact is the foundation of research design.
Significance Level (α) — Type I Error Rate
α is the pre-set threshold you compare the p-value to. It is also the exact probability that a correct null hypothesis will be rejected by chance. Setting α = 0.05 means you accept a 5% rate of false positives across repeated tests under a true null. Fields with severe consequences of a false positive — particle physics, genomics — use α = 0.0001 or smaller. Exploratory research sometimes accepts α = 0.10.
α = 0.05 — conventional threshold in most research
α = 0.01 — conservative (medicine, safety)
α = 0.001 — very strict (genomics, physics)
Beta (β) — Type II Error Rate
β is the probability that a test fails to detect a real effect. A β of 0.20 means a 20% chance of missing an effect that genuinely exists. This is the conventional maximum, which corresponds to a power of 0.80 — the benchmark set by Jacob Cohen in his foundational 1988 book Statistical Power Analysis for the Behavioral Sciences.
β = 0.20 — conventional maximum (Power = 0.80)
β = 0.10 — high-stakes research (Power = 0.90)
β depends on n, effect size, α, and σ
Statistical Power (1 − β)
Power is the probability of correctly rejecting a false null hypothesis. A test with power = 0.80 will detect a true effect 80% of the time across repeated studies. Power analysis before data collection calculates the sample size needed to reach your target power. The key inputs are α, the expected effect size, the population variance, and your target β.
n increases
Power ↑ when α increases
Power ↑ when effect size increases
Power ↓ when σ (noise) increases
For a fixed sample size, lowering α (reducing Type I errors) raises the rejection threshold, which also makes it harder to detect real effects — so β increases and power falls. The only way to reduce both errors simultaneously is to collect more data. This is why power analysis belongs before a study begins, not after.
Type I and Type II Errors — 6 Worked Examples
Each example below identifies the null hypothesis, describes what each type of error would mean in context, and explains which error carries greater cost. These are the kinds of scenarios that appear on exams and in research practice.
Example 1 — Medical Screening (Cancer Test)
A hospital uses a blood test to screen patients for a rare cancer. The test has a 5% false positive rate and a 15% false negative rate.
Null hypothesis: The patient does not have cancer (H₀: no disease).
Type I error (false positive): The test says the patient has cancer, but they are healthy. The patient undergoes unnecessary follow-up procedures, anxiety, and possibly harmful treatment. α = 0.05 here.
Type II error (false negative): The test says the patient is healthy, but they actually have cancer. The disease goes untreated and may progress to a life-threatening stage. β = 0.15 here.
⚠️ Verdict: In cancer screening, a Type II error (missed cancer) is typically far more serious than a Type I error. This is why screening tests are often designed with low β even at the cost of more false positives — the follow-up tests filter those out.
Example 2 — Criminal Justice (Trial Analogy)
A defendant is on trial. The jury must decide: guilty or not guilty.
Null hypothesis: The defendant is innocent (H₀: not guilty).
Type I error: The jury convicts an innocent person. This is considered the graver error in criminal law — "better that ten guilty persons escape than one innocent suffer" (Blackstone's ratio).
Type II error: The jury acquits a guilty person. The guilty party goes free. This is bad but considered less catastrophic than punishing the innocent.
✅ The criminal justice system sets α very low (high standard of proof — "beyond reasonable doubt") to minimize Type I errors, accepting a higher β as the cost. This is a deliberate societal choice about which error type to prioritize.
Example 3 — Drug Approval (Clinical Trial)
A pharmaceutical company tests whether a new antidepressant reduces depression scores more than a placebo. The trial uses α = 0.05.
Null hypothesis: The drug has no effect compared to the placebo (H₀: μ_drug = μ_placebo).
Type I error: The trial concludes the drug works, but it does not. The FDA approves an ineffective drug. Patients pay for a useless treatment and miss out on effective alternatives. Real-world cost: very high.
Type II error: The trial finds no significant result, but the drug genuinely works. An effective treatment never reaches patients. Regulatory agencies like the FDA require high power (≥ 0.80) precisely to reduce this error.
✅ Both errors matter here. The FDA uses α = 0.05 as a Type I guard and mandates prospective power analyses to control β. An underpowered trial that fails to detect a real benefit harms patients just as much as approving a useless drug.
Example 4 — Manufacturing Quality Control
A factory produces bolts with a target diameter of 10 mm. Quality control samples 30 bolts per batch and tests H₀: μ = 10 mm at α = 0.01.
Null hypothesis: The machine is calibrated correctly and producing 10 mm bolts (H₀: μ = 10).
Type I error: The test rejects H₀ — the machine is shut down for recalibration — but the machine was working fine. Result: lost production time and labor cost with no defect problem.
Type II error: The test fails to reject H₀, but the machine has drifted. Defective bolts continue to ship. In safety-critical applications (aerospace, automotive), this can cause product failures.
✅ The factory uses α = 0.01 to minimize unnecessary shutdowns (Type I), but also ensures n = 30 gives adequate power. The appropriate balance depends on the cost of defects vs the cost of downtime.
Example 5 — A/B Testing (Digital Product)
An e-commerce company tests whether a new checkout button color increases conversion rate. They run an A/B test for two weeks with α = 0.05 and target power = 0.80.
Null hypothesis: The new button color has no effect on conversion rate (H₀: p_new = p_control).
Type I error: The test concludes the new color converts better, but the result was random noise. The company ships the change, wasting engineering resources on an ineffective update and possibly disrupting the user experience.
Type II error: The test finds no significant difference, but the new button genuinely converts better by 2%. The company misses revenue. This often happens when tests are stopped too early before reaching the planned sample size.
✅ A/B testing in product development commonly suffers from "peeking" — checking results before the planned sample size is reached. Stopping early inflates Type I error rates and prevents reaching the statistical power needed to detect small but real effects.
Example 6 — Psychology Research
A researcher tests whether a mindfulness intervention reduces anxiety scores. n = 40, α = 0.05, power analysis suggests power ≈ 0.65 — below the 0.80 standard.
Null hypothesis: Mindfulness training does not reduce anxiety (H₀: no treatment effect).
Type I error (α = 0.05): The study reports a significant anxiety reduction, but the effect was a fluke. This contributes to the replication crisis in psychology if the finding is published and other labs try and fail to replicate it.
Type II error (β ≈ 0.35): With power = 0.65, there is a 35% chance of missing a real treatment effect. The underpowered study may conclude "no effect" and shelve a genuinely helpful intervention.
⚠️ This study is underpowered. The researcher should either increase n to approximately 85 (to reach power = 0.80) or report the study as preliminary and interpret a null result cautiously. Publishing underpowered null results as "no effect" findings is a methodological error.
The Tradeoff Between Type I and Type II Errors
For a fixed dataset, reducing one type of error increases the other. This inverse relationship follows directly from the mechanics of hypothesis testing: the significance threshold determines both the boundary for rejection and how sensitive the test is to real effects.
If you lower α to reduce false positives, you move the rejection region further out — which means real effects near the boundary are no longer detected, and β increases. If you raise α to catch more real effects, you also catch more false ones. Given fixed n, you cannot minimize both simultaneously.
How to Reduce Both Errors
The way out of the tradeoff is larger samples. With more data, the sampling distribution of the test statistic narrows, which means the rejection region can be placed further out (lower α) while still overlapping more with the distribution under H₁ — so β falls and power rises. This is why power analysis begins with your target α and desired power and solves for the required sample size.
Lower α (e.g., 0.05 → 0.01)
Fewer Type I errors. Type II errors increase. Power falls. Use when false positives are costly.
Increase Sample Size n
Reduces both Type I and Type II errors by narrowing sampling variability. The primary lever in power analysis.
Increase Effect Size
Larger effects are easier to detect, so β falls. Achieved by stronger treatments, purer populations, or better measurement instruments.
Reduce Variability (σ)
Tighter measurement and controlled conditions reduce noise, improving the signal-to-noise ratio and lowering β.
Power and Beta Calculator
This calculator estimates statistical power (1 − β) and the Type II error rate (β) for a one-sample z-test. Enter your significance level, sample size, expected effect size, and population standard deviation. The result tells you the probability of detecting a real effect of the given magnitude.
Type II Error & Power Calculator
Which Error Is Worse — Type I or Type II?
There is no universal answer. The relative severity of each error depends entirely on the consequences of being wrong in a particular direction. The researcher — not the statistician — makes this judgment before choosing α.
| Domain | Type I Error | Type II Error | Which Is Worse? |
|---|---|---|---|
| Cancer screening | Treat a healthy patient | Miss a real cancer | Type II (life-threatening) |
| Criminal law | Convict the innocent | Acquit the guilty | Type I (legal principle) |
| Drug approval | Approve an ineffective drug | Reject an effective drug | Context-dependent |
| Spam filtering | Block a legitimate email | Let spam through | Type I (miss important mail) |
| A/B testing | Ship a useless change | Miss a real improvement | Depends on cost of shipping |
| Nuclear plant safety | Shut down a safe plant | Miss a real fault | Type II (safety critical) |
| Pregnancy test | False positive (says pregnant) | False negative (misses pregnancy) | Context-dependent |
How to Remember Type I and Type II Errors
Type II error = missing the real wolf: When the wolf actually comes, no one believes the shepherd. The real threat goes undetected.
Type II = no alarm when the building is on fire. The real danger is missed — a false negative.
Type I = False alarm (you falsely declared something significant). Type II = Missed detection (you missed a real signal). Or: think of Type I as the "eager" error (too quick to reject) and Type II as the "lazy" error (not sensitive enough to detect).
Complete Reference Table — All Key Formulas
| Term | Symbol | Formula / Value | Plain-Language Meaning |
|---|---|---|---|
| Type I Error | α | P(Reject H₀ | H₀ true) | Rate of false positives; set by the researcher |
| Type II Error | β | P(Fail to reject H₀ | H₀ false) | Rate of false negatives; depends on n, effect, σ |
| Statistical Power | 1 − β | P(Reject H₀ | H₀ false) | Probability of detecting a real effect |
| Specificity | 1 − α | P(Fail to reject H₀ | H₀ true) | Probability of a true negative result |
| Significance level | α | Typically 0.05 | Pre-set threshold for p-value comparison |
| p-value | p | P(data ≥ observed | H₀ true) | Evidence against H₀; reject if p < α |
| Effect size (Cohen's d) | d | (μ₁ − μ₀) / σ | Standardized magnitude of the true difference |
| Standard Error | SE | σ / √n | Precision of the sample mean; falls with larger n |
| Critical value (z, two-tailed, α=0.05) | z* | ±1.96 | Boundary of the rejection region |
| Non-centrality parameter | δ | (μ₁ − μ₀) / (σ/√n) | How far the true effect is from H₀ in SE units |
Real-World Applications
The same logic of balancing false positives against false negatives applies across every domain that uses statistical inference. Recognizing which error carries greater harm in your context is the first step to designing an appropriate test.
Medical Diagnostics
Sensitivity (1 − β) and specificity (1 − α) are the clinical equivalents of power and significance level. Diagnostic tests are designed with known trade-offs between them.
Clinical Trials
Phase III drug trials target power ≥ 0.80 at α = 0.05. The FDA evaluates both endpoints to ensure neither ineffective drugs get approved nor effective ones get missed.
Spam Detection
Spam filters face the same tradeoff: block too aggressively (Type I: legitimate mail marked as spam) or too loosely (Type II: spam gets through). Most systems let users adjust the threshold.
Genomics / GWAS
Genome-wide association studies test millions of variants simultaneously, requiring α = 5×10⁻⁸ to control family-wise Type I error. This demands very large samples to maintain power.
Machine Learning
In binary classifiers, Type I error rate = false positive rate (1 − specificity) and Type II error rate = false negative rate (1 − recall/sensitivity). The ROC curve plots the full tradeoff.
Process Control (SPC)
Control charts define warning limits (Type I) and detection ability (Type II). Tighter control limits catch more real shifts but also trigger more false alarms.
FAQs
A Type I error occurs when a true null hypothesis is rejected. The result is a false positive — the test concludes an effect exists when none does. The probability of this error equals α, the significance level set before data collection. At α = 0.05, 1 in 20 tests will produce a Type I error when H₀ is true, purely by chance.
A Type II error occurs when a false null hypothesis is not rejected — a false negative. The test fails to detect a real effect that exists in the population. The probability is β (beta). Standard practice targets β ≤ 0.20, meaning a power of at least 80%. β decreases when you increase sample size, use a larger α, or study a bigger effect.
A Type I error (α) is a false positive — you reject a correct null hypothesis. A Type II error (β) is a false negative — you fail to reject an incorrect null hypothesis. Both are mistakes, but in opposite directions. Reducing one increases the other when sample size is fixed. The only way to reduce both is increasing sample size.
Power = 1 − β. A test with high power has a low Type II error rate and detects real effects more reliably. Power increases with larger sample size, larger α, or stronger effects. However, lowering α reduces Type I errors but also reduces power (increases β). Power analysis helps determine required sample size.
It depends on context. In criminal justice, Type I error (convicting an innocent person) is more serious. In medical screening, Type II error (missing a disease) is often more dangerous. In research and industry, the cost of each error depends on consequences, not statistics alone.
Increasing sample size is the main way to reduce both. More data reduces variability, allowing better separation between null and alternative hypotheses. Better measurement quality and study design also help reduce both errors.
No. A single test outcome can only produce one type of error depending on whether the null hypothesis is true or false. Across multiple tests, both types can occur in different instances.
Type I error is α (alpha). Type II error is β (beta). Statistical power is 1 − β. These are standard symbols in hypothesis testing and decision theory.
In classification problems, Type I error is a false positive and Type II error is a false negative. These correspond to false positive rate and false negative rate. ROC curves visualize the tradeoff between them across thresholds.
Sources and Further Reading
- Neyman, J. & Pearson, E.S. (1933) — "On the Problem of the Most Efficient Tests of Statistical Hypotheses." Philosophical Transactions of the Royal Society A, 231, 289–337. Foundational paper defining the two error types.
- Cohen, J. (1988) — Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum. Established the convention of Power = 0.80. Publisher page.
- NIST/SEMATECH (2012) — e-Handbook of Statistical Methods. National Institute of Standards and Technology. NIST Handbook §7.4 — Type I and Type II Errors.
- FDA (2019) — Adaptive Designs for Clinical Trials of Drugs and Biologics. U.S. Food and Drug Administration Guidance for Industry. fda.gov.
- UCLA Statistics Consulting Group — Power Analysis for Research. UCLA Institute for Digital Research and Education. stats.oarc.ucla.edu.