What Is the Power of a Test?
Where β (beta) is the probability of a Type II error — failing to detect a real effect when one genuinely exists. Power and β are two sides of the same coin: a test with 80% power has a 20% chance of missing a true effect.
The concept was formalized by Jerzy Neyman and Egon Pearson as part of their framework for decision-making in hypothesis testing. Jacob Cohen later popularized power analysis in behavioral science through his landmark 1988 book, establishing the 0.80 threshold now standard across research disciplines. For the broader context of how power fits into hypothesis testing, see the hypothesis testing guide at Statistics Fundamentals.
The Intuitive Meaning of Statistical Power
Picture a metal detector at an airport. Its "power" is the probability it beeps when someone actually carries metal. A weak detector (low power) misses threats — that is a Type II error. A sensitive detector (high power) catches them reliably.
In statistics, the "metal" is a real effect in your data. The detector is your hypothesis test. Low power means your study design is too weak to reliably find effects that genuinely exist. High power means you will almost certainly detect them.
This matters because a non-significant result from a low-power study is uninformative. It cannot distinguish "there is no effect" from "our study was too small to see one." That ambiguity is why power analysis belongs at the design stage, before data collection begins.
A statistically non-significant result only means something if the study had sufficient power. An underpowered study that finds nothing has not shown the null hypothesis is true — it has shown the researchers could not afford a proper test.
Statistical Power Formula
The power formula follows directly from the definition. Since β is the probability of failing to reject a false H₀, power is its complement:
Power = P(Reject H₀ | H₀ is false)
β = P(Fail to Reject H₀ | H₀ is false) = Type II Error
α = Significance level = Type I Error rate
For a one-sample z-test with a two-tailed alternative, the power calculation works out to:
Φ = standard normal CDF
μ₁ = true population mean
μ₀ = null hypothesis mean
σₓ̄ = σ/√n = standard error
zα/2 = critical value for α
In practice, the second term is tiny and is often dropped. The simplified working formula becomes:
δ = |μ₁ − μ₀| = effect size (raw)
zα = 1.645 for α = 0.05 one-tailed; 1.96 for two-tailed
The Non-Centrality Parameter
The quantity δ/σₓ̄ appears in every power calculation. Statisticians call it the non-centrality parameter (λ or δ). It measures how far the true distribution is from the null hypothesis distribution in units of standard error. Larger λ means more separation between the two distributions and therefore greater power. This connection explains why effect size and sample size are the two most controllable levers on power.
Power, Type I Error, and Type II Error
Every hypothesis test produces one of four outcomes. Two are correct decisions; two are errors. Power occupies the most important cell of this table:
| H₀ Is True | H₀ Is False (Effect Exists) | |
|---|---|---|
| Reject H₀ | Type I Error (α) False Positive |
✅ Correct Rejection = Power = 1 − β |
| Fail to Reject H₀ | ✅ Correct Retention = 1 − α |
Type II Error (β) False Negative = Miss |
The key relationship to internalize: α and β are separate probabilities, and the researcher can set α directly (usually 0.05) but can only influence β through study design choices. Reducing α (making the test stricter) actually increases β and therefore decreases power, all else equal. The only way to reduce both simultaneously is to increase sample size.
- α (alpha): Probability of a false positive (Type I error) — rejecting a true H₀. Set by the researcher (usually 0.05)
- β (beta): Probability of a false negative (Type II error) — missing a real effect. Controlled by study design
- Power = 1 − β: Probability of detecting a real effect. Target ≥ 0.80
- Inverse relationship: Lowering α increases β (reduces power), unless sample size increases
- Both improve together only when sample size grows or effect size is larger
Factors That Affect Statistical Power
Four variables determine the power of a test. Understanding each one clarifies why power analysis is a sample-size planning tool.
1. Sample Size (n)
Sample size is the most direct lever on power and the one researchers control most easily. Larger samples reduce the standard error (σ/√n), tightening the sampling distribution around the true mean. This makes it easier to distinguish a real effect from sampling noise. Doubling the sample size does not double power, but it substantially raises it — the exact relationship depends on effect size and α.
| Sample Size (n) | Effect Size d = 0.2 (Small) | Effect Size d = 0.5 (Medium) | Effect Size d = 0.8 (Large) |
|---|---|---|---|
| 20 | 0.11 | 0.34 | 0.57 |
| 40 | 0.17 | 0.54 | 0.80 |
| 80 | 0.28 | 0.78 | 0.97 |
| 100 | 0.33 | 0.87 | 0.99 |
| 200 | 0.55 | 0.99 | 1.00 |
Power values for a two-sample t-test, two-tailed, α = 0.05. Cohen's d = (μ₁ − μ₂)/σ. Computed using non-central t distribution.
2. Effect Size
Effect size measures how large the true difference or relationship is. A large effect is easier to detect with a smaller sample than a small effect. The most common standardized measures are Cohen's d (for means), Pearson's r (for correlations), and Cohen's f² (for regression). Cohen's benchmarks for d are: small = 0.2, medium = 0.5, large = 0.8. For a more detailed treatment of these measures, the Pearson correlation guide covers r extensively.
Estimating effect size from a pilot study of n = 10–20 gives extremely imprecise estimates. Use published literature or theoretical minimums when planning sample size — pilot estimates are unreliable input to power calculations.
3. Significance Level (α)
A higher significance level (less strict threshold) makes it easier to reject H₀, which increases power. Setting α = 0.10 instead of α = 0.05 raises power for the same sample size and effect size. However, it also raises the Type I error rate. Researchers in exploratory work sometimes accept α = 0.10; confirmatory clinical trials often require α = 0.01 or stricter, demanding larger samples to compensate for the power cost.
4. Population Variability (σ)
Lower variability in the outcome variable means less overlap between the null and alternative distributions, making the effect easier to detect. This is why measurement precision matters: more reliable instruments reduce within-group variance, effectively boosting power without adding participants. Designs that reduce noise — matched pairs, repeated measures, covariate adjustment — are power-increasing strategies.
| Factor | How to Change It | Effect on Power |
|---|---|---|
| Sample size (n) | Recruit more participants | ⬆ Increases |
| Effect size (δ) | Target a more sensitive outcome; choose a stronger intervention | ⬆ Increases with larger δ |
| Significance level (α) | Raise α (e.g., 0.05 → 0.10) | ⬆ Increases (at cost of Type I error) |
| Population variance (σ²) | Use more reliable measurement; homogeneous samples | ⬆ Increases with lower σ² |
| One-tailed vs two-tailed | Use a one-tailed test when direction is known a priori | ⬆ Slightly increases |
| Test selection | Use a more powerful test (parametric over nonparametric when assumptions hold) | ⬆ Can increase substantially |
Power Thresholds: What Is Good Statistical Power?
Jacob Cohen's 1988 recommendation of 0.80 as a minimum acceptable power level has become the field standard. His reasoning: at this threshold, the β:α ratio is 4:1 — Type II errors are four times more tolerable than Type I errors, reflecting the relative costs of missing an effect versus falsely claiming one.
| Statistical Power | Interpretation | Typical Context |
|---|---|---|
| < 0.50 | Very Low — test is barely better than a coin flip | Severely underpowered; results uninformative |
| 0.50 – 0.59 | Low | Preliminary/exploratory only |
| 0.60 – 0.79 | Moderate | Acceptable for exploratory research |
| 0.80 – 0.89 | Recommended minimum (Cohen's benchmark) | Most behavioral and social science research |
| 0.90 – 0.94 | High | Clinical trials, safety research |
| ≥ 0.95 | Very High | Confirmatory trials; regulatory submissions |
The 0.80 threshold is a guideline, not a law. The appropriate power depends on the costs of Type II errors in a given context. Missing a harmful drug interaction in a Phase III trial is far more costly than missing a small effect in a preliminary social psychology experiment. Researchers should document and justify their chosen power target.
How to Calculate the Power of a Test
The calculation procedure uses the normal (or t) distribution to find the probability of landing in the rejection region when the alternative hypothesis is true. The steps are the same regardless of whether you compute by hand, use a table, or use software.
State the hypotheses and specify μ₁
Define H₀: μ = μ₀ and H₁: μ ≠ μ₀ (or directional). Specify the effect you want to detect — the true mean μ₁ under the alternative hypothesis. This is the minimum meaningful effect size in the context of your study.
Set α and identify the critical value(s)
Choose your significance level. For α = 0.05, two-tailed, the critical z-values are ±1.96. For a one-tailed test, the critical value is z = 1.645. For t-tests, look up the critical t from the t-distribution table.
Calculate the standard error
For a one-sample test: SE = σ/√n. For a two-sample test: SE = √(σ₁²/n₁ + σ₂²/n₂). You need either the known population standard deviation or a reasonable estimate from prior literature.
Find the critical value on the alternative distribution
Express the rejection boundary in terms of the alternative distribution. The critical boundary (in raw units) is μ₀ ± zα/2 × SE. Then compute how many SEs that boundary sits away from μ₁: z* = (μ₀ ± zα/2 × SE − μ₁) / SE.
Look up the power from the standard normal table
Power = Φ(z*) for a one-tailed test, where Φ is the standard normal CDF. For a two-tailed test, add the contribution from both tails (the upper tail term is usually negligible). Use the z-table to find Φ.
Interpret and report
State: "The study has [X]% power to detect an effect of size δ = [value] at α = [value] with n = [value]." If power is below 0.80, increase n and repeat until the target is met. Report β = 1 − Power alongside the power value.
Worked Examples — Power of a Test
Example 1 — One-Sample Z-Test
Problem: A manufacturer claims battery life is μ₀ = 500 hours (σ = 40 hours). Engineers want to detect if the true mean is only μ₁ = 488 hours. With n = 64 batteries and α = 0.05 (two-tailed), what is the power of the test?
SE = σ/√n
zα/2 = 1.96 for α = 0.05
Hypotheses: H₀: μ = 500 | H₁: μ ≠ 500. True mean to detect: μ₁ = 488.
Standard error: SE = 40/√64 = 40/8 = 5.0
Non-centrality: |μ₁ − μ₀|/SE = |488 − 500|/5 = 12/5 = 2.4
Power calculation: z* = 2.4 − 1.96 = 0.44. Power = Φ(0.44) ≈ 0.67
β = 1 − 0.67 = 0.33. Type II error rate is 33%.
✅ Conclusion: Power = 0.67 (67%). This is below the 0.80 threshold. To reach 80% power, the engineers need a larger sample — using the calculator below, n ≈ 87 batteries achieves 0.80 power for this effect.
Example 2 — Two-Sample T-Test
Problem: A clinical researcher compares a new drug (expected to lower systolic blood pressure by 8 mmHg) against placebo. Population SD is σ = 20 mmHg. Each group has n = 40 patients. α = 0.05, two-tailed. What is the power?
Effect size: d = 8/20 = 0.40 (between small and medium)
Non-centrality: d × √(n/2) = 0.40 × √20 = 0.40 × 4.472 = 1.789
Power: z* = 1.789 − 1.96 = −0.171. Power = Φ(−0.171) ≈ 0.43
Interpretation: β = 0.57. There is a 57% chance of missing this clinically meaningful effect.
✅ Conclusion: Power = 0.43 — well below the 0.80 minimum. To detect d = 0.40 at α = 0.05 with 80% power, each group needs approximately n = 100 participants (total N = 200). This is a finding of significant concern for study design. See the two-sample t-test guide for the full test procedure.
Example 3 — Proportion Test
Problem: A website conversion rate is claimed to be 5% (p₀ = 0.05). A marketing team wants to detect if a new landing page raises it to 8% (p₁ = 0.08). With n = 300 visitors and α = 0.05 (one-tailed), what is the power?
SE = √(p₀(1−p₀)/n)
SE: √(0.05 × 0.95 / 300) = √(0.0001583) = 0.01258
Effect / SE: (0.08 − 0.05) / 0.01258 = 0.03 / 0.01258 = 2.384
Power: z* = 2.384 − 1.645 = 0.739. Power = Φ(0.739) ≈ 0.77
✅ Conclusion: Power = 0.77 (77%). Close to the target but slightly short. Increasing to n = 340 visitors pushes power above 0.80 for this 3-percentage-point improvement. See the proportion hypothesis testing guide for the full test.
Power of a Test Calculator
Enter your study parameters below. The calculator uses the standard normal approximation, which applies accurately when σ is known or n ≥ 30. For t-test power with small samples, the result is a close approximation.
📊 Statistical Power Calculator
How to Increase the Power of a Test
When a power analysis reveals insufficient power, researchers have several routes to correct it before collecting data. Each involves a trade-off.
Increase Sample Size
The most reliable and widely applicable solution. Required sample size scales roughly with the square of the ratio (zα + zβ)/d. Doubling the detectable effect size reduces needed sample size by a factor of four. The sample size calculator computes n directly from power, α, and effect size inputs.
Raise the Significance Level
In exploratory research, using α = 0.10 instead of α = 0.05 increases power without changing n. This is justified when Type II errors are more costly than Type I errors — for example, when screening candidates for follow-up study and false negatives (missed leads) matter more than false positives.
Reduce Measurement Variance
Using more precise instruments, standardizing procedures, training data collectors, or choosing a more homogeneous sample reduces σ and increases power without adding participants. Matched-pairs designs and multiple regression with covariates are statistical techniques that reduce residual variance, effectively boosting power.
Use a One-Tailed Test When Justified
When prior theory or evidence strongly supports a directional prediction, a one-tailed test concentrates the rejection region in one tail, lowering the critical value and increasing power for effects in that direction. This choice must be made before seeing the data and justified in advance, not selected after observing results.
Choose a More Powerful Test
Parametric tests (t-test, ANOVA) are more powerful than nonparametric equivalents (Mann-Whitney, Kruskal-Wallis) when normality assumptions hold. The statistical test selector helps identify the most powerful applicable test for a given data structure.
Power Analysis: Planning Studies Around Power
Power analysis is the use of power calculations in study design. Three types exist, differing in which quantity they solve for:
A Priori Power Analysis
Performed before data collection. Inputs: desired power, α, effect size. Output: required sample size. This is the standard, correct approach — journals and grant agencies expect it.
Post Hoc Power Analysis
Calculated after a non-significant result using the observed effect size. Widely criticized: the "observed power" from a non-significant test is mathematically guaranteed to be low, giving no additional information.
Sensitivity Analysis
Inputs: achieved sample size, desired power, α. Output: minimum detectable effect size (MDE). Useful when sample size is fixed by external constraints (budget, population size).
The recommended software for power analysis is G*Power (free, by Franz Faul at University of Kiel), which handles over 50 test types. In R, the pwr package by Helios De Rosario covers the most common cases. Python users can use statsmodels.stats.power.
Real-World Applications of Statistical Power
Clinical Trials
Regulatory agencies (FDA, EMA) require prespecified power ≥ 0.80, often 0.90, before approving trial protocols. Underpowered trials waste resources and expose patients to risk without yielding useful evidence.
A/B Testing (Digital)
E-commerce and SaaS teams run A/B tests to detect conversion rate differences. Stopping a test before the planned sample size is reached — "peeking" — inflates Type I error and reduces effective power.
Educational Research
Studies evaluating teaching interventions often involve small effect sizes (d = 0.2–0.3). Achieving 0.80 power for small effects requires large samples (n > 200 per group), which many school-based studies cannot reach.
Quality Control
Manufacturing processes use control charts and acceptance sampling. The power of these tests determines their ability to detect process shifts — directly affecting defect rates that reach customers.
Genomics/GWAS
Genome-wide association studies test millions of SNPs at α = 5×10⁻⁸. Detecting small genetic effects (OR ~1.1) with this strict threshold requires sample sizes in the hundreds of thousands.
Economics & Finance
Testing whether a trading strategy beats a benchmark requires accounting for the high variability of returns. Many backtests are underpowered — apparent outperformance may be noise.
Power vs Related Statistical Concepts
| Concept | Definition | Relationship to Power |
|---|---|---|
| Statistical Power (1 − β) | P(Reject H₀ | H₀ false) | The concept itself |
| Type II Error (β) | P(Fail to Reject H₀ | H₀ false) | Power = 1 − β; they sum to 1 |
| Significance Level (α) | P(Reject H₀ | H₀ true) | Raising α increases power but also Type I error |
| Confidence Interval | Range of plausible parameter values | A 95% CI corresponds to α = 0.05; wider CI = less power |
| Effect Size (d, r, f²) | Standardized magnitude of the effect | Larger effect → higher power for same n |
| p-value | P(data | H₀ true) | Low p-value indicates significance; power determines if p will be low |
| Sample Size (n) | Number of observations | Larger n → smaller SE → higher power |
Frequently Asked Questions
pwr package. For a two-sample t-test: pwr.t.test(d=0.5, n=64, sig.level=0.05, type="two.sample", alternative="two.sided"). For a one-sample z-test: pwr.norm.test(d=0.5, n=64, sig.level=0.05, alternative="two.sided"). Install with install.packages("pwr").Related Topics in Hypothesis Testing
Statistical power sits within the broader framework of hypothesis testing. Concepts that connect directly to power include:
- Null and alternative hypothesis — the two competing claims that define the power calculation
- p-values — the observed significance from a completed test; power predicts how likely it is that p < α
- Type I and Type II errors — the two error types that bound power
- Significance level — α, which trades off against power
- Cohen's d — the most common standardized effect size for power calculations
- Sample size calculator — solve for the n that achieves your desired power
- Confidence interval for the mean — the interval-estimation complement to power-based inference
- Central Limit Theorem — the result that justifies the normal approximation underlying most power formulas