What is the power of a test in statistics?

The power of a test is the probability of correctly rejecting a false null hypothesis. It equals 1 minus the Type II error rate (β). A power of 0.80 means an 80% chance of detecting a real effect when one exists.

What is the formula for statistical power?

Statistical power = 1 − β, where β is the probability of a Type II error (failing to reject a false null hypothesis). If β = 0.20, power = 0.80.

What is a good statistical power value?

The conventional minimum is 0.80 (80%). Cohen (1988) established this threshold for most behavioral research. Clinical and safety studies often require 0.90 or higher to reduce the risk of missing a true effect.

How does sample size affect the power of a test?

Larger samples reduce standard error, making it easier to detect a true effect. As sample size increases, statistical power increases. Halving the standard error roughly doubles the non-centrality parameter, substantially raising power.

What is the relationship between power and Type II error?

Power and Type II error (β) are complementary: Power = 1 − β. If β = 0.20 (a 20% chance of missing a real effect), then power = 0.80. Reducing β automatically increases power.

What is power analysis?

Power analysis is the process of determining the sample size needed to achieve a desired power level, or of estimating the power a study actually has given its sample size, effect size, and significance level. It is performed before data collection (a priori) to plan adequate studies.

Power of a Test (Statistical Power): Complete Reference Guide (2026)

What Is the Power of a Test?

Definition — Power of a Test (Statistical Power)

The power of a test is the probability that a hypothesis test correctly rejects the null hypothesis when the null hypothesis is actually false. In other words, it measures a test's ability to detect a real effect.

Power = 1 − β = P(Reject H₀ | H₀ is false)

Where β (beta) is the probability of a Type II error — failing to detect a real effect when one genuinely exists. Power and β are two sides of the same coin: a test with 80% power has a 20% chance of missing a true effect.

The concept was formalized by Jerzy Neyman and Egon Pearson as part of their framework for decision-making in hypothesis testing. Jacob Cohen later popularized power analysis in behavioral science through his landmark 1988 book, establishing the 0.80 threshold now standard across research disciplines. For the broader context of how power fits into hypothesis testing, see the hypothesis testing guide at Statistics Fundamentals.

1 − β

Power Formula

0.80

Minimum Recommended Power

Type II Error Rate

Significance Level (Type I Error)

The Intuitive Meaning of Statistical Power

Picture a metal detector at an airport. Its "power" is the probability it beeps when someone actually carries metal. A weak detector (low power) misses threats — that is a Type II error. A sensitive detector (high power) catches them reliably.

In statistics, the "metal" is a real effect in your data. The detector is your hypothesis test. Low power means your study design is too weak to reliably find effects that genuinely exist. High power means you will almost certainly detect them.

This matters because a non-significant result from a low-power study is uninformative. It cannot distinguish "there is no effect" from "our study was too small to see one." That ambiguity is why power analysis belongs at the design stage, before data collection begins.

💡

Key Insight

A statistically non-significant result only means something if the study had sufficient power. An underpowered study that finds nothing has not shown the null hypothesis is true — it has shown the researchers could not afford a proper test.

Statistical Power Formula

The power formula follows directly from the definition. Since β is the probability of failing to reject a false H₀, power is its complement:

Core Statistical Power Formula

Power = 1 − β

Power = P(Reject H₀ | H₀ is false) β = P(Fail to Reject H₀ | H₀ is false) = Type II Error α = Significance level = Type I Error rate

For a one-sample z-test with a two-tailed alternative, the power calculation works out to:

Power of a One-Sample Z-Test (Two-Tailed)

Power = Φ(|μ₁ − μ₀|/σₓ̄ − z_α/2) + Φ(−|μ₁ − μ₀|/σₓ̄ − z_α/2)

Φ = standard normal CDF μ₁ = true population mean μ₀ = null hypothesis mean σₓ̄ = σ/√n = standard error z_α/2 = critical value for α

In practice, the second term is tiny and is often dropped. The simplified working formula becomes:

Simplified Power Formula (One-Tailed or Dominant Term)

Power ≈ Φ(δ/σₓ̄ − z_α)

δ = |μ₁ − μ₀| = effect size (raw) z_α = 1.645 for α = 0.05 one-tailed; 1.96 for two-tailed

The Non-Centrality Parameter

The quantity δ/σₓ̄ appears in every power calculation. Statisticians call it the non-centrality parameter (λ or δ). It measures how far the true distribution is from the null hypothesis distribution in units of standard error. Larger λ means more separation between the two distributions and therefore greater power. This connection explains why effect size and sample size are the two most controllable levers on power.

Formula derivation follows Neyman, J. & Pearson, E.S. (1933). "On the problem of the most efficient tests of statistical hypotheses." Philosophical Transactions of the Royal Society A, 231, 289–337. Standardization per Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Power, Type I Error, and Type II Error

Every hypothesis test produces one of four outcomes. Two are correct decisions; two are errors. Power occupies the most important cell of this table:

	H₀ Is True	H₀ Is False (Effect Exists)
Reject H₀	Type I Error (α) False Positive	✅ Correct Rejection = Power = 1 − β
Fail to Reject H₀	✅ Correct Retention = 1 − α	Type II Error (β) False Negative = Miss

The key relationship to internalize: α and β are separate probabilities, and the researcher can set α directly (usually 0.05) but can only influence β through study design choices. Reducing α (making the test stricter) actually increases β and therefore decreases power, all else equal. The only way to reduce both simultaneously is to increase sample size.

⚡ Power vs Error Rates — Quick Reference

α (alpha): Probability of a false positive (Type I error) — rejecting a true H₀. Set by the researcher (usually 0.05)
β (beta): Probability of a false negative (Type II error) — missing a real effect. Controlled by study design
Power = 1 − β: Probability of detecting a real effect. Target ≥ 0.80
Inverse relationship: Lowering α increases β (reduces power), unless sample size increases
Both improve together only when sample size grows or effect size is larger

Factors That Affect Statistical Power

Four variables determine the power of a test. Understanding each one clarifies why power analysis is a sample-size planning tool.

1. Sample Size (n)

Sample size is the most direct lever on power and the one researchers control most easily. Larger samples reduce the standard error (σ/√n), tightening the sampling distribution around the true mean. This makes it easier to distinguish a real effect from sampling noise. Doubling the sample size does not double power, but it substantially raises it — the exact relationship depends on effect size and α.

Sample Size (n)	Effect Size d = 0.2 (Small)	Effect Size d = 0.5 (Medium)	Effect Size d = 0.8 (Large)
20	0.11	0.34	0.57
40	0.17	0.54	0.80
80	0.28	0.78	0.97
100	0.33	0.87	0.99
200	0.55	0.99	1.00

Power values for a two-sample t-test, two-tailed, α = 0.05. Cohen's d = (μ₁ − μ₂)/σ. Computed using non-central t distribution.

2. Effect Size

Effect size measures how large the true difference or relationship is. A large effect is easier to detect with a smaller sample than a small effect. The most common standardized measures are Cohen's d (for means), Pearson's r (for correlations), and Cohen's f² (for regression). Cohen's benchmarks for d are: small = 0.2, medium = 0.5, large = 0.8. For a more detailed treatment of these measures, the Pearson correlation guide covers r extensively.

⚠️

Common Mistake

Estimating effect size from a pilot study of n = 10–20 gives extremely imprecise estimates. Use published literature or theoretical minimums when planning sample size — pilot estimates are unreliable input to power calculations.

3. Significance Level (α)

A higher significance level (less strict threshold) makes it easier to reject H₀, which increases power. Setting α = 0.10 instead of α = 0.05 raises power for the same sample size and effect size. However, it also raises the Type I error rate. Researchers in exploratory work sometimes accept α = 0.10; confirmatory clinical trials often require α = 0.01 or stricter, demanding larger samples to compensate for the power cost.

4. Population Variability (σ)

Lower variability in the outcome variable means less overlap between the null and alternative distributions, making the effect easier to detect. This is why measurement precision matters: more reliable instruments reduce within-group variance, effectively boosting power without adding participants. Designs that reduce noise — matched pairs, repeated measures, covariate adjustment — are power-increasing strategies.

Factor	How to Change It	Effect on Power
Sample size (n)	Recruit more participants	⬆ Increases
Effect size (δ)	Target a more sensitive outcome; choose a stronger intervention	⬆ Increases with larger δ
Significance level (α)	Raise α (e.g., 0.05 → 0.10)	⬆ Increases (at cost of Type I error)
Population variance (σ²)	Use more reliable measurement; homogeneous samples	⬆ Increases with lower σ²
One-tailed vs two-tailed	Use a one-tailed test when direction is known a priori	⬆ Slightly increases
Test selection	Use a more powerful test (parametric over nonparametric when assumptions hold)	⬆ Can increase substantially

Power Thresholds: What Is Good Statistical Power?

Jacob Cohen's 1988 recommendation of 0.80 as a minimum acceptable power level has become the field standard. His reasoning: at this threshold, the β:α ratio is 4:1 — Type II errors are four times more tolerable than Type I errors, reflecting the relative costs of missing an effect versus falsely claiming one.

Statistical Power	Interpretation	Typical Context
< 0.50	Very Low — test is barely better than a coin flip	Severely underpowered; results uninformative
0.50 – 0.59	Low	Preliminary/exploratory only
0.60 – 0.79	Moderate	Acceptable for exploratory research
0.80 – 0.89	Recommended minimum (Cohen's benchmark)	Most behavioral and social science research
0.90 – 0.94	High	Clinical trials, safety research
≥ 0.95	Very High	Confirmatory trials; regulatory submissions

The 0.80 threshold is a guideline, not a law. The appropriate power depends on the costs of Type II errors in a given context. Missing a harmful drug interaction in a Phase III trial is far more costly than missing a small effect in a preliminary social psychology experiment. Researchers should document and justify their chosen power target.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 1, pp. 55–56 establishes the 0.80 benchmark. For free online power tables, the G*Power software from Heinrich Heine University is the standard tool.

How to Calculate the Power of a Test

The calculation procedure uses the normal (or t) distribution to find the probability of landing in the rejection region when the alternative hypothesis is true. The steps are the same regardless of whether you compute by hand, use a table, or use software.

State the hypotheses and specify μ₁

Define H₀: μ = μ₀ and H₁: μ ≠ μ₀ (or directional). Specify the effect you want to detect — the true mean μ₁ under the alternative hypothesis. This is the minimum meaningful effect size in the context of your study.

Set α and identify the critical value(s)

Choose your significance level. For α = 0.05, two-tailed, the critical z-values are ±1.96. For a one-tailed test, the critical value is z = 1.645. For t-tests, look up the critical t from the t-distribution table.

Calculate the standard error

For a one-sample test: SE = σ/√n. For a two-sample test: SE = √(σ₁²/n₁ + σ₂²/n₂). You need either the known population standard deviation or a reasonable estimate from prior literature.

Find the critical value on the alternative distribution

Express the rejection boundary in terms of the alternative distribution. The critical boundary (in raw units) is μ₀ ± z_α/2 × SE. Then compute how many SEs that boundary sits away from μ₁: z* = (μ₀ ± z_α/2 × SE − μ₁) / SE.

Look up the power from the standard normal table

Power = Φ(z*) for a one-tailed test, where Φ is the standard normal CDF. For a two-tailed test, add the contribution from both tails (the upper tail term is usually negligible). Use the z-table to find Φ.

Interpret and report

State: "The study has [X]% power to detect an effect of size δ = [value] at α = [value] with n = [value]." If power is below 0.80, increase n and repeat until the target is met. Report β = 1 − Power alongside the power value.

Worked Examples — Power of a Test

Example 1 — One-Sample Z-Test

Worked Example 1 — One-Sample Z-Test Power

Problem: A manufacturer claims battery life is μ₀ = 500 hours (σ = 40 hours). Engineers want to detect if the true mean is only μ₁ = 488 hours. With n = 64 batteries and α = 0.05 (two-tailed), what is the power of the test?

Power of a One-Sample Z-Test

Power = Φ(|μ₁ − μ₀|/SE − z_α/2)

SE = σ/√n z_α/2 = 1.96 for α = 0.05

Hypotheses: H₀: μ = 500 | H₁: μ ≠ 500. True mean to detect: μ₁ = 488.

Standard error: SE = 40/√64 = 40/8 = 5.0

Non-centrality: |μ₁ − μ₀|/SE = |488 − 500|/5 = 12/5 = 2.4

Power calculation: z* = 2.4 − 1.96 = 0.44. Power = Φ(0.44) ≈ 0.67

β = 1 − 0.67 = 0.33. Type II error rate is 33%.

✅ Conclusion: Power = 0.67 (67%). This is below the 0.80 threshold. To reach 80% power, the engineers need a larger sample — using the calculator below, n ≈ 87 batteries achieves 0.80 power for this effect.

Example 2 — Two-Sample T-Test

Worked Example 2 — Two-Sample T-Test Power

Problem: A clinical researcher compares a new drug (expected to lower systolic blood pressure by 8 mmHg) against placebo. Population SD is σ = 20 mmHg. Each group has n = 40 patients. α = 0.05, two-tailed. What is the power?

Cohen's d and Two-Sample Power

d = δ/σ Power = Φ(d√(n/2) − z_α/2)

Effect size: d = 8/20 = 0.40 (between small and medium)

Non-centrality: d × √(n/2) = 0.40 × √20 = 0.40 × 4.472 = 1.789

Power: z* = 1.789 − 1.96 = −0.171. Power = Φ(−0.171) ≈ 0.43

Interpretation: β = 0.57. There is a 57% chance of missing this clinically meaningful effect.

✅ Conclusion: Power = 0.43 — well below the 0.80 minimum. To detect d = 0.40 at α = 0.05 with 80% power, each group needs approximately n = 100 participants (total N = 200). This is a finding of significant concern for study design. See the two-sample t-test guide for the full test procedure.

Example 3 — Proportion Test

Worked Example 3 — Proportion Hypothesis Test Power

Problem: A website conversion rate is claimed to be 5% (p₀ = 0.05). A marketing team wants to detect if a new landing page raises it to 8% (p₁ = 0.08). With n = 300 visitors and α = 0.05 (one-tailed), what is the power?

Power for One-Proportion Z-Test (One-Tailed)

Power = Φ((p₁ − p₀)/SE − z_α)

SE = √(p₀(1−p₀)/n)

SE: √(0.05 × 0.95 / 300) = √(0.0001583) = 0.01258

Effect / SE: (0.08 − 0.05) / 0.01258 = 0.03 / 0.01258 = 2.384

Power: z* = 2.384 − 1.645 = 0.739. Power = Φ(0.739) ≈ 0.77

✅ Conclusion: Power = 0.77 (77%). Close to the target but slightly short. Increasing to n = 340 visitors pushes power above 0.80 for this 3-percentage-point improvement. See the proportion hypothesis testing guide for the full test.

Example calculations follow procedures in Rosner, B. (2015). Fundamentals of Biostatistics (8th ed.). Cengage Learning, Chapter 8. Z-table lookups from Statistics Fundamentals Z-Table.

Power of a Test Calculator

Enter your study parameters below. The calculator uses the standard normal approximation, which applies accurately when σ is known or n ≥ 30. For t-test power with small samples, the result is a close approximation.

📊 Statistical Power Calculator

Null Hypothesis Mean (μ₀)

True Mean to Detect (μ₁)

Population SD (σ)

Sample Size (n)

Significance Level (α)

Test Type

0%50% (Low)80% (Target)100%

How to Increase the Power of a Test

When a power analysis reveals insufficient power, researchers have several routes to correct it before collecting data. Each involves a trade-off.

Increase Sample Size

The most reliable and widely applicable solution. Required sample size scales roughly with the square of the ratio (z_α + z_β)/d. Doubling the detectable effect size reduces needed sample size by a factor of four. The sample size calculator computes n directly from power, α, and effect size inputs.

Raise the Significance Level

In exploratory research, using α = 0.10 instead of α = 0.05 increases power without changing n. This is justified when Type II errors are more costly than Type I errors — for example, when screening candidates for follow-up study and false negatives (missed leads) matter more than false positives.

Reduce Measurement Variance

Using more precise instruments, standardizing procedures, training data collectors, or choosing a more homogeneous sample reduces σ and increases power without adding participants. Matched-pairs designs and multiple regression with covariates are statistical techniques that reduce residual variance, effectively boosting power.

Use a One-Tailed Test When Justified

When prior theory or evidence strongly supports a directional prediction, a one-tailed test concentrates the rejection region in one tail, lowering the critical value and increasing power for effects in that direction. This choice must be made before seeing the data and justified in advance, not selected after observing results.

Choose a More Powerful Test

Parametric tests (t-test, ANOVA) are more powerful than nonparametric equivalents (Mann-Whitney, Kruskal-Wallis) when normality assumptions hold. The statistical test selector helps identify the most powerful applicable test for a given data structure.

Power Analysis: Planning Studies Around Power

Power analysis is the use of power calculations in study design. Three types exist, differing in which quantity they solve for:

🔬

A Priori Power Analysis

Performed before data collection. Inputs: desired power, α, effect size. Output: required sample size. This is the standard, correct approach — journals and grant agencies expect it.

📋

Post Hoc Power Analysis

Calculated after a non-significant result using the observed effect size. Widely criticized: the "observed power" from a non-significant test is mathematically guaranteed to be low, giving no additional information.

🎯

Sensitivity Analysis

Inputs: achieved sample size, desired power, α. Output: minimum detectable effect size (MDE). Useful when sample size is fixed by external constraints (budget, population size).

The recommended software for power analysis is G*Power (free, by Franz Faul at University of Kiel), which handles over 50 test types. In R, the pwr package by Helios De Rosario covers the most common cases. Python users can use statsmodels.stats.power.

For worked power calculations in R, see the pwr package vignette (CRAN). For G*Power tutorials, see Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). "G*Power 3: A flexible statistical power analysis program." Behavior Research Methods, 39(2), 175–191.

Real-World Applications of Statistical Power

🏥

Clinical Trials

Regulatory agencies (FDA, EMA) require prespecified power ≥ 0.80, often 0.90, before approving trial protocols. Underpowered trials waste resources and expose patients to risk without yielding useful evidence.

💻

A/B Testing (Digital)

E-commerce and SaaS teams run A/B tests to detect conversion rate differences. Stopping a test before the planned sample size is reached — "peeking" — inflates Type I error and reduces effective power.

🎓

Educational Research

Studies evaluating teaching interventions often involve small effect sizes (d = 0.2–0.3). Achieving 0.80 power for small effects requires large samples (n > 200 per group), which many school-based studies cannot reach.

📊

Quality Control

Manufacturing processes use control charts and acceptance sampling. The power of these tests determines their ability to detect process shifts — directly affecting defect rates that reach customers.

🧬

Genomics/GWAS

Genome-wide association studies test millions of SNPs at α = 5×10⁻⁸. Detecting small genetic effects (OR ~1.1) with this strict threshold requires sample sizes in the hundreds of thousands.

📈

Economics & Finance

Testing whether a trading strategy beats a benchmark requires accounting for the high variability of returns. Many backtests are underpowered — apparent outperformance may be noise.

Power vs Related Statistical Concepts

Concept	Definition	Relationship to Power
Statistical Power (1 − β)	P(Reject H₀ \| H₀ false)	The concept itself
Type II Error (β)	P(Fail to Reject H₀ \| H₀ false)	Power = 1 − β; they sum to 1
Significance Level (α)	P(Reject H₀ \| H₀ true)	Raising α increases power but also Type I error
Confidence Interval	Range of plausible parameter values	A 95% CI corresponds to α = 0.05; wider CI = less power
Effect Size (d, r, f²)	Standardized magnitude of the effect	Larger effect → higher power for same n
p-value	P(data \| H₀ true)	Low p-value indicates significance; power determines if p will be low
Sample Size (n)	Number of observations	Larger n → smaller SE → higher power

Frequently Asked Questions

What does 80% statistical power mean?

80% power means that if a real effect of the specified size exists, the test will correctly detect it and reject the null hypothesis in 80 out of every 100 independent replications of the study. Equivalently, β = 0.20: there is a 20% chance of a Type II error — missing the effect — in any single study.

Can the power of a test exceed 1?

No. Power is a probability, so it is bounded between 0 and 1 (0% to 100%). As sample size grows toward infinity, power approaches 1 asymptotically. Numerical software can sometimes return values like 0.99999 which round to 1 — that is expected behavior, not an error.

Is the power of a test related to which error?

Power is directly related to the Type II error (β). Specifically, Power = 1 − β. It has an indirect relationship with the Type I error (α): raising α increases power, but at the cost of a higher false-positive rate. Power is not simply the complement of α — that is the specific confidence level.

What sample size gives 80% power?

It depends entirely on the effect size and α. For a two-sample t-test at α = 0.05 two-tailed: a small effect (d = 0.2) requires n ≈ 394 per group; a medium effect (d = 0.5) requires n ≈ 64 per group; a large effect (d = 0.8) requires n ≈ 26 per group. Use the calculator on this page or G*Power for exact values.

How does the level of significance affect the power of a test?

Higher significance levels (larger α) correspond to less strict rejection thresholds, making it easier to reject H₀ and thus increasing power. For example, at d = 0.5 and n = 50 per group: α = 0.10 gives power ≈ 0.77; α = 0.05 gives power ≈ 0.70; α = 0.01 gives power ≈ 0.51. The power gain from raising α must be weighed against the increased Type I error rate.

How do you calculate power of a test in R?

Use the pwr package. For a two-sample t-test: pwr.t.test(d=0.5, n=64, sig.level=0.05, type="two.sample", alternative="two.sided"). For a one-sample z-test: pwr.norm.test(d=0.5, n=64, sig.level=0.05, alternative="two.sided"). Install with install.packages("pwr").

What is the power of a test in AP Stats?

In AP Statistics, the power of a test is defined as the probability of correctly rejecting the null hypothesis when it is false. AP Stats specifically tests understanding that power increases with larger sample size, larger effect size, and higher α. Students should know the formula Power = 1 − β and be able to reason about what changes would increase power without necessarily computing numerical values.

Statistical power sits within the broader framework of hypothesis testing. Concepts that connect directly to power include:

Null and alternative hypothesis — the two competing claims that define the power calculation
p-values — the observed significance from a completed test; power predicts how likely it is that p < α
Type I and Type II errors — the two error types that bound power
Significance level — α, which trades off against power
Cohen's d — the most common standardized effect size for power calculations
Sample size calculator — solve for the n that achieves your desired power
Confidence interval for the mean — the interval-estimation complement to power-based inference
Central Limit Theorem — the result that justifies the normal approximation underlying most power formulas