Bonferroni-Adjusted Alpha & Critical Value Calculator
What Is a Bonferroni-Adjusted Critical Value?
When you run m hypothesis tests using the same dataset or within the same study, the probability of getting at least one false positive grows with each additional test — even when all null hypotheses are true. The Bonferroni correction addresses this by lowering the significance threshold for each individual test.
The adjusted alpha is simply αadj = αfamily / m. The adjusted critical value is the two-tailed z score corresponding to this stricter threshold. A test statistic must exceed this higher bar to be counted as significant under the corrected framework.
Key distinction: αadj is what you compare each test's p-value against. The Bonferroni zcrit is what you compare each test statistic against. Both lead to identical decisions. The choice between them is a matter of what your software reports.
Bonferroni-Adjusted Critical Value Table
Each row gives αadj = αfamily/m and the two-tailed z critical value for that adjusted threshold. Select the tab for your chosen family-wise α. Click any row to load its values into the calculator.
Two-tailed zcrit = InvNorm(1 − αadj/2). Reject H₀ for comparison i if |zi| ≥ zcrit, or equivalently if pi,raw ≤ αadj. Values verified against the standard normal distribution. m = total number of simultaneous hypothesis tests in the family.
How to Use the Bonferroni Critical Value Table
The procedure is direct. Each step has a single statistical decision attached to it.
Step 1 — Set Family-Wise α Before Data Collection
Decide on αfamily as part of your study design, not after seeing the results. The standard in most fields is α = 0.05. Clinical trials often use α = 0.01 for secondary endpoints; genomic studies sometimes use α = 0.05/genome-wide SNP count ≈ 5 × 10⁻⁸.
Step 2 — Count Every Planned Comparison (m)
List each null hypothesis you will test. Count them. If you have k groups in a one-way ANOVA and want all pairwise tests, m = k(k−1)/2:
k=4 groups → m = 4(4−1)/2 = 6 pairs
k=5 groups → m = 5(5−1)/2 = 10 pairs
Step 3 — Read αadj and zcrit from the Table
Find the row for your m in the table above. The second column gives αadj = α/m. The third column gives the two-tailed z critical value corresponding to that adjusted threshold. If your exact m is not listed, use the next larger m (conservative) or enter your exact m in the calculator.
Step 4 — Apply the Decision Rule
If |zobs| < zcrit → Fail to reject H₀ → Not significant after correction
Equivalently: reject H₀ if praw ≤ αadj. The two rules are identical because praw = P(|Z| ≥ |zobs|) under H₀.
Step 5 — Report Results with Correction Stated
Always state m, αfamily, and αadj in your methods section. APA format example: "To control the family-wise error rate across m = 6 pairwise comparisons, a Bonferroni correction was applied (αadj = 0.008). Comparison 2 reached the corrected threshold: z = 2.79, p = 0.005 < 0.008."
The Multiple Comparisons Problem & Bonferroni Formula
Running m independent tests at the same α level means the chance of at least one false positive climbs sharply. The Bonferroni correction counteracts this inflation by tightening the per-test threshold.
Type I Error Inflation Without Correction
For m independent tests each at nominal α = 0.05, the family-wise error rate is:
The Bonferroni Adjustment Formula
By Boole's inequality, FWER ≤ m × αadj for any test configuration (independent or dependent). Setting m × αadj = αfamily gives:
The corrected two-tailed z critical value is then: zcrit = InvNorm(1 − αadj/2). For one-tailed tests: zcrit = InvNorm(1 − αadj).
Bonferroni-Adjusted p-Values
Some software (including R's p.adjust()) reports Bonferroni-adjusted p-values rather than adjusted alphas. The two approaches are equivalent:
Compare padj against αfamily (e.g. 0.05)
Dividing α by m (threshold approach) and multiplying p by m (adjusted p-value approach) always produce the same significance decision.
Multi-Alpha Reference Matrix
Adjusted alpha values and two-tailed z critical values across all three standard family-wise significance levels for the most commonly used comparison counts.
| m | αadj (α=0.10) | zcrit (α=0.10) | αadj (α=0.05) | zcrit (α=0.05) | αadj (α=0.01) | zcrit (α=0.01) |
|---|
Two-tailed zcrit. αadj shown to six decimal places. zcrit = InvNorm(1 − αadj/2).
Worked Examples Across Seven Research Contexts
Each example below shows how to identify m, compute αadj, and find the corrected critical value for a specific research scenario.
Example 1 — Multi-Arm Clinical Trial (Oncology)
Setup: A drug trial tests against control on five endpoints: overall survival, progression-free survival, objective response rate, toxicity grade, and quality-of-life score. αfamily = 0.05.
Reporting: "A Bonferroni correction was applied across m = 5 primary endpoints (αadj = 0.010). Endpoint 1 (OS): z = 2.74, p = 0.006 < 0.010 — significant. Endpoint 3 (ORR): z = 2.31, p = 0.021 > 0.010 — not significant after correction."
Example 2 — Post-Hoc ANOVA (Psychology, 4 Groups)
Setup: A study compares cognitive performance across four sleep deprivation conditions (0 h, 12 h, 24 h, 48 h awake). A significant one-way ANOVA triggers pairwise t-tests.
Decision rule: Each pairwise t-test rejects H₀ only if |t| ≥ 2.638 (or use the t-distribution with the appropriate df for small n). With n = 15 per group, df = 28 per comparison, the Bonferroni tcrit from the t-table at α = 0.00833 and df = 28 is approximately 2.73.
Example 3 — Genomic SNP Association (Bioinformatics)
Setup: A candidate-gene study tests 100 SNPs for association with rheumatoid arthritis. αfamily = 0.05.
Note: Genome-wide association studies (GWAS) often test millions of SNPs. For m = 1,000,000 the Bonferroni threshold is approximately zcrit = 5.45, equivalent to p < 5 × 10⁻⁸. This is why GWAS discoveries require very large samples and replication cohorts.
Example 4 — Psychometric Assessment (3 Scales × 2 Timepoints)
Setup: Evaluates three psychometric outcomes (anxiety, depression, QoL) at two timepoints (post-treatment and 6-month follow-up), giving m = 6 planned tests.
Example 5 — Industrial Quality Control (8 Machine Outputs)
Setup: Monitoring physical tolerances across 8 concurrent drill-press outputs against a target spec. αfamily = 0.01 (strict manufacturing standard).
Example 6 — A/B/n Testing (4 Checkout Variants)
Setup: An e-commerce platform compares 4 new checkout designs against the current control. αfamily = 0.05.
Practical note: In A/B testing, traffic is typically split evenly. With n = 10,000 users per variant and a baseline conversion rate of 5%, the minimum detectable effect at this corrected threshold requires roughly 14% relative lift instead of the uncorrected 10%.
Example 7 — Neuroimaging Regions of Interest (20 Structures)
Setup: An fMRI study tests grey matter volume differences across 20 cortical regions between patients and controls. αfamily = 0.05.
Alternative: For whole-brain voxel-wise analyses with thousands of voxels, FDR (Benjamini-Hochberg) is typically preferred over Bonferroni because it has substantially higher power when many true effects are present across the brain.
Bonferroni vs. Alternative Multiple Testing Corrections
Selecting the right correction depends on your error control objective, the number of tests, and how correlated those tests are. The table below maps each method to its primary use case.
| Procedure | Error Control | Power | Best For | Key Limitation |
|---|---|---|---|---|
| Bonferroni | FWER | Low — most conservative | Small m (<10), confirmatory endpoints, independent tests | Over-conservative with correlated tests or large m |
| Holm (Step-Down) | FWER | Moderate — always ≥ Bonferroni | Same FWER guarantee as Bonferroni, any m | Slightly more complex; requires sorted p-values |
| Šidák | FWER | Marginally higher than Bonferroni | Independent orthogonal contrasts | Assumes test independence; fails for correlated tests |
| Tukey HSD | FWER | High for all-pairwise ANOVA | All pairwise comparisons of group means after ANOVA | Only valid for balanced ANOVA post-hoc; not general |
| Benjamini-Hochberg (FDR) | FDR (expected false discovery proportion) | High — especially for large m | Genomics, neuroimaging, large-scale screening | Allows some false positives; requires replication |
| Benjamini-Yekutieli (FDR) | FDR — valid under any dependence | Lower than BH | Correlated tests where BH assumptions may fail | More conservative than standard BH; rarely needed |
When Holm Is Preferable to Bonferroni
Holm's step-down method tests hypotheses in order from smallest to largest p-value. It uses a stricter threshold for the most significant result (same as Bonferroni) but a progressively less strict threshold for subsequent tests. The result is the same FWER guarantee as Bonferroni but strictly more power — there is no situation where Bonferroni detects something Holm misses, but Holm can detect effects Bonferroni misses. For this reason, Holm is the standard recommendation in most statistical guidelines when FWER control is needed. See the hypothesis testing guide for a full comparison.
Common Mistakes in Bonferroni Correction
Pooling Unrelated Tests
Combining tests from separate studies or unrelated research questions into one family inflates m needlessly, destroying power for all tests. Only tests addressing the same scientific question should share a correction family.
Multiplying α Instead of Dividing
αadj = α/m (divide α to get threshold). The Bonferroni adjusted p-value = p × m (multiply p). Confusing the two leads to wrong conclusions. Dividing the p-value or multiplying α are both errors.
Post-Hoc Selection of m
Setting m after seeing which tests were significant — to shrink the correction — is a form of p-hacking. The number of comparisons must be determined by the study design and pre-registered, not by results.
Applying Bonferroni to Subgroups Separately
If a study tests the same hypothesis in 5 subgroups plus the full sample, all 6 tests belong to the same family (m = 6). Applying Bonferroni within each subgroup separately (m = 1 per subgroup) violates the family-wise logic.
Software Implementation
Both R and Python compute Bonferroni-adjusted p-values and critical values directly. Paste the snippets below into your analysis script.
R — Using p.adjust()
raw_p <- c(0.004, 0.012, 0.031, 0.045, 0.121, 0.550)
# Bonferroni-adjusted p-values (multiply by m, capped at 1)
adj_p <- p.adjust(raw_p, method = "bonferroni")
print(adj_p)
# [1] 0.024 0.072 0.186 0.270 0.726 1.000
# Alternative methods available: "holm", "hochberg", "BH", "BY"
holm_p <- p.adjust(raw_p, method = "holm")
Python — Using statsmodels
raw_p = [0.004, 0.012, 0.031, 0.045, 0.121, 0.550]
# Bonferroni correction
reject, adj_p, _, _ = multipletests(raw_p, alpha=0.05, method='bonferroni')
print(f"Adjusted p-values: {adj_p}")
print(f"Rejected H0: {reject}")
# Holm–Bonferroni (more power, same FWER guarantee)
reject_holm, adj_holm, _, _ = multipletests(raw_p, alpha=0.05, method='holm')
Computing zcrit for Any m and α in Python
import numpy as np
def bonferroni_zcrit(alpha_family, m, tails=2):
alpha_adj = alpha_family / m
if tails == 2:
return norm.ppf(1 - alpha_adj / 2)
return norm.ppf(1 - alpha_adj)
print(bonferroni_zcrit(0.05, 10)) # → 2.807
print(bonferroni_zcrit(0.05, 20)) # → 3.023
Frequently Asked Questions
What is a Bonferroni-adjusted critical value?
A Bonferroni-adjusted critical value is the modified z (or t) threshold applied to each of m simultaneous hypothesis tests to hold the family-wise error rate at α. It equals InvNorm(1 − αadj/2) for two-tailed tests, where αadj = α/m. Every comparison must clear this higher bar to be counted as significant.
Why is Bonferroni called conservative?
The correction relies on Boole's inequality, which bounds FWER ≤ Σαadj for any correlation structure among tests. When tests are positively correlated — which is typical in most datasets — the actual FWER is lower than this bound, meaning the correction requires a stricter threshold than necessary. The Holm and Šidák methods improve on this while keeping the same theoretical FWER guarantee.
Can Bonferroni be applied to t-tests instead of z-tests?
Yes. For t-tests with finite sample sizes, use the tcrit from the t-distribution table at αadj/2 and the relevant degrees of freedom, rather than the z table. The zcrit values in this table are valid approximations when n is large (>30 per group). For small samples, always use the t-distribution.
How many comparisons require Bonferroni correction?
There is no fixed cutoff — in principle, any m ≥ 2 tests that share a decision-making family should be considered. In practice, many researchers apply Bonferroni from m = 3 onward, since with m = 2 the uncorrected FWER at α = 0.05 is only ~9.75%, which some consider acceptable. For m = 1, no correction is needed (αadj = α). The question of which tests belong in the same family is a scientific judgment, not a statistical one.
What does it mean if my result is not significant after Bonferroni?
A result that reaches p < 0.05 but not p < αadj does not mean the effect is absent — only that the evidence is insufficient to declare significance under the stricter family-wise threshold. This is especially common with small m and borderline p-values. In such cases, reporting the uncorrected p, the effect size, and a confidence interval gives readers the full picture and allows them to assess practical importance independently of the correction.
How do Bonferroni-adjusted confidence intervals work?
For m simultaneous confidence intervals with a family-wise coverage of (1 − α), each individual interval uses (1 − αadj) = (1 − α/m) as its nominal confidence level. The critical value in the CI formula becomes zcrit from the Bonferroni table. The resulting set of intervals collectively contains all true parameter values with probability ≥ 1 − α. For example, with α = 0.05 and m = 5, each CI is a 99% interval (αadj = 0.01), and together they form a 95% simultaneous confidence region.
References & Further Reading
Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8, 3–62. The original paper establishing the inequality underlying the correction.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. The step-down extension that improves power while keeping the same FWER guarantee.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300. Introduced the FDR concept and the BH procedure. doi:10.1111/j.2517-6161.1995.tb02031.x
NIST/SEMATECH e-Handbook of Statistical Methods (2013). Section 7.3.6: Multiple comparisons. National Institute of Standards and Technology. itl.nist.gov
Westfall, P. H., & Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley. Comprehensive treatment of FWER and FDR methods with worked examples across biomedical research contexts.
Related Statistical Tables & Resources
Understanding What Bonferroni Correction Does and Does Not Guarantee
What It Guarantees
Bonferroni guarantees that the probability of at least one false positive among all m tests is ≤ αfamily. This holds regardless of the correlation structure among the tests, because Boole's inequality does not require independence. It is a strong, unconditional bound on FWER.
What It Does Not Guarantee
Bonferroni does not control the false discovery rate (FDR) — the proportion of significant results that are false positives. With 100 tests and 50 true effects, a significant Bonferroni result is very likely a true positive, but the correction is also likely to miss many of the true 50. FDR methods are designed for exactly this trade-off. See the hypothesis testing guide for a full comparison of error control frameworks.
Statistical Significance vs. Effect Size
A result that survives Bonferroni correction is statistically significant — but the correction says nothing about whether the effect is practically meaningful. Always report effect sizes (Cohen's d, η², odds ratio) alongside corrected p-values. A corrected p = 0.003 with Cohen's d = 0.08 is statistically significant but trivially small. See the effect size guide for detail.