What Is a P-Value? (Definition)
The core idea is a thought experiment: if the null hypothesis were true and you repeated the same study many times, how often would you see a result as extreme as yours? A p-value of 0.04 means 4% of the time — infrequent enough to raise doubt about H₀. A p-value of 0.60 means 60% of the time — your result is entirely ordinary under H₀ and gives no reason to reject it.
That probability is measured against a pre-set threshold called the significance level, written α. The most common value is α = 0.05, established by convention in R.A. Fisher's 1925 work. When p < α, the result is called statistically significant and you reject H₀. When p ≥ α, you fail to reject H₀ — never "accept" it, because absence of evidence is not evidence of absence.
This is covered as part of the broader hypothesis testing framework at Statistics Fundamentals, where the full 6-step procedure is described in detail. The p-value specifically answers step 5 of that procedure.
- What it measures: Compatibility of your sample data with the null hypothesis
- Range: Always between 0 and 1. Cannot be negative or greater than 1.
- Small p-value: Your data would be rare if H₀ were true — evidence against H₀
- Large p-value: Your data is consistent with H₀ — no reason to reject it
- Decision rule: Reject H₀ when p < α (usually 0.05)
- What it does not measure: The probability H₀ is true, the size of an effect, or practical importance
How to Interpret a P-Value
Interpretation is where most mistakes happen. The p-value tells you one thing: how surprising your data would be if H₀ were true. It tells you nothing else directly. Here is a concrete framework for reading any p-value you encounter.
Evidence Strength Scale
P-Value Interpretation Guide
These labels are conventions, not hard rules. The American Statistical Association's 2016 statement — Wasserstein & Lazar (2016) — explicitly cautions against treating 0.05 as a bright line between meaningful and meaningless results. Context matters: a p-value of 0.04 in a small exploratory study calls for different weight than the same value in a pre-registered trial with n = 5,000.
One-Tailed vs Two-Tailed P-Values
The directionality of H₁ determines which tail of the distribution you count. A two-tailed test asks "is the effect in either direction?" and uses both tails, so it is twice as conservative as the corresponding one-tailed test. Always match the tail count to your H₁ — chosen before data collection, not after.
| Test Type | H₁ | P-Value Calculation | Use When |
|---|---|---|---|
| Two-tailed | μ ≠ μ₀ | p = 2 × P(Z > |z|) | You expect an effect but not its direction |
| Right-tailed (upper) | μ > μ₀ | p = P(Z > z) | You expect the parameter to be larger |
| Left-tailed (lower) | μ < μ₀ | p = P(Z < z) | You expect the parameter to be smaller |
P-Value Formula and Calculation
There is no single p-value formula because the calculation depends on the test statistic your study uses. The steps are always the same: compute the test statistic, then find the probability of observing that value or more extreme under the null distribution.
From a Z-Test
x̄ = sample mean
μ₀ = null hypothesis mean
σ = known population SD
n = sample size
Φ is the standard normal CDF. Once you have z, you look up the area in the tail of the normal distribution using a z-table. For |z| = 1.96 the one-tail area is 0.025, so the two-tailed p-value is 0.05 — exactly the conventional cutoff.
From a T-Test
s = sample SD
df = n − 1 (one-sample)
Once you have t and df, you find p using the t-distribution with df degrees of freedom. Because the t-distribution has heavier tails than the normal, the same test statistic produces a larger p-value than a z-test would — the extra uncertainty from not knowing σ is accounted for. Use the t-distribution table or software for the exact p-value.
From a Chi-Square Test
O = observed frequency
E = expected frequency
df = (rows−1)(cols−1)
The p-value from a chi-square test is always one-tailed (upper) because χ² is always positive. Large χ² means observed and expected counts diverge greatly — evidence against independence or goodness-of-fit. See the chi-square table for critical values.
Calculating P-Values in Software
| Software | Function / Command | Returns |
|---|---|---|
| Excel | =NORM.S.DIST(-ABS(z),TRUE)*2 | Two-tailed z p-value |
| Excel | =T.DIST.2T(ABS(t),df) | Two-tailed t p-value |
| Excel | =CHISQ.DIST.RT(chi2,df) | Chi-square p-value |
| R | 2*pnorm(-abs(z)) | Two-tailed z p-value |
| R | 2*pt(-abs(t), df) | Two-tailed t p-value |
| R | pchisq(chi2, df, lower.tail=FALSE) | Chi-square p-value |
| Python (scipy) | stats.norm.sf(abs(z))*2 | Two-tailed z p-value |
| Python (scipy) | stats.t.sf(abs(t), df)*2 | Two-tailed t p-value |
| SPSS | Reported automatically in output tables | Listed as "Sig." |
Worked Examples — P-Values Step by Step
Each example follows the same sequence: state hypotheses, set α, compute the test statistic, find the p-value, and draw a conclusion. The arithmetic is shown in full so you can see exactly where the number comes from.
Example 1 — One-Sample Z-Test
A bottling plant claims its bottles contain μ = 500 mL on average (σ = 12 mL known). Quality control samples 64 bottles and finds x̄ = 503.6 mL. Find the p-value at α = 0.05 for a two-tailed test.
Hypotheses: H₀: μ = 500 mL | H₁: μ ≠ 500 mL (two-tailed)
Significance level: α = 0.05. Critical values: z = ±1.96 for a two-tailed test.
Standard error: SE = σ/√n = 12/√64 = 12/8 = 1.5
Test statistic: z = (503.6 − 500) / 1.5 = 3.6 / 1.5 = 2.40
P-value: One-tail area for z = 2.40 is P(Z > 2.40) = 0.0082 (from the z-table). Two-tailed p = 2 × 0.0082 = p = 0.016
Decision: p = 0.016 < α = 0.05 → Reject H₀
✅ Conclusion: At the 5% level, there is sufficient evidence that the mean fill volume differs from 500 mL. The sample mean of 503.6 mL is statistically significantly higher than the claimed average.
Example 2 — One-Sample T-Test
A sleep researcher believes adults sleep less than the recommended 8 hours. A sample of 16 adults yields x̄ = 7.1 hours with s = 1.2 hours. Is there evidence the population mean is below 8 hours? Use α = 0.05, one-tailed.
df = 16 − 1 = 15
Left-tailed test (H₁: μ < 8)
Hypotheses: H₀: μ = 8 hours | H₁: μ < 8 hours (left-tailed)
α = 0.05. With df = 15, the one-tailed critical value is t* = −1.753 from the t-distribution table.
Test statistic: t = (7.1 − 8) / (1.2/4) = −0.9 / 0.3 = −3.00
P-value: P(t₁₅ < −3.00). From the t-distribution, the area in the left tail for t = −3.00 with df = 15 gives p ≈ 0.005
Decision: p = 0.005 < α = 0.05 → Reject H₀. Also |t| = 3.00 > 1.753.
✅ Conclusion: At the 5% level, the evidence supports that adults sleep significantly less than 8 hours on average. See the full one-sample t-test guide for the complete methodology.
Example 3 — Two-Sample T-Test
Two teaching methods are compared. Group A (n = 20, x̄ = 78, s = 9) and Group B (n = 20, x̄ = 83, s = 11). Is the mean difference statistically significant at α = 0.05?
Hypotheses: H₀: μ_A = μ_B | H₁: μ_A ≠ μ_B (two-tailed)
Pooled SE: SE = √(9²/20 + 11²/20) = √(81/20 + 121/20) = √(10.1) = 3.178
Test statistic: t = (78 − 83) / 3.178 = −5 / 3.178 = −1.573
P-value: With df ≈ 37 (Welch's approximation), two-tailed p ≈ 0.124. See the two-sample t-test page for the df formula.
Decision: p = 0.124 > α = 0.05 → Fail to reject H₀. The 5-point difference does not reach significance with these sample sizes and variances.
⚠️ Conclusion: There is not sufficient evidence at α = 0.05 to conclude the two teaching methods produce different mean scores. Note: "fail to reject" does not mean the methods are identical — it means the data are inconclusive.
Example 4 — Chi-Square Test of Independence
A marketing analyst surveys 200 customers to test whether purchase decision (Yes/No) is independent of ad type (Video/Banner). Observed: Video-Yes=60, Video-No=40, Banner-Yes=45, Banner-No=55.
Hypotheses: H₀: Purchase and ad type are independent | H₁: They are not independent
Expected frequencies: E(Video-Yes) = (100×105)/200 = 52.5 | E(Video-No) = 47.5 | E(Banner-Yes) = 52.5 | E(Banner-No) = 47.5
Test statistic: χ² = (60−52.5)²/52.5 + (40−47.5)²/47.5 + (45−52.5)²/52.5 + (55−47.5)²/47.5 = 1.071 + 1.184 + 1.071 + 1.184 = 4.511
P-value: df = (2−1)(2−1) = 1. From the chi-square table, χ²(1) = 3.841 at α = 0.05. Our χ² = 4.511 > 3.841, so p ≈ 0.034
Decision: p = 0.034 < α = 0.05 → Reject H₀
✅ Conclusion: At α = 0.05, there is significant evidence that purchase decision and ad type are not independent — video ads produced proportionally more purchases.
Example 5 — P-Value in Simple Linear Regression
In a simple linear regression of sales on advertising spend (n = 25), the estimated slope is b₁ = 2.35 with SE(b₁) = 0.78. Is the slope significantly different from zero?
Hypotheses: H₀: β₁ = 0 (no linear relationship) | H₁: β₁ ≠ 0
Test statistic: t = b₁ / SE(b₁) = 2.35 / 0.78 = 3.013
Degrees of freedom: df = n − 2 = 25 − 2 = 23
P-value: Two-tailed, t(23) = 3.013. The t-table gives critical value 2.069 at α = 0.05 and 2.807 at α = 0.01. Since 3.013 > 2.807, p < 0.01 (more precisely, p ≈ 0.006).
Decision: p ≈ 0.006 < α = 0.05 → Reject H₀. The slope is statistically significant.
✅ Conclusion: Advertising spend has a statistically significant positive linear relationship with sales. See the full simple linear regression guide for how slope, intercept, and R² work together.
P-Value vs Significance Level — Key Differences
These two quantities are easy to confuse but they play completely different roles. The significance level is a decision parameter you choose; the p-value is a statistic you calculate.
| Aspect | P-Value | Significance Level (α) |
|---|---|---|
| What it is | Calculated from your sample data | Set by the researcher before data collection |
| When determined | After running the test | Before collecting data |
| Typical values | Anything between 0 and 1 | 0.05 (default), 0.01, or 0.10 |
| What it represents | Evidence against H₀ in your specific sample | Maximum acceptable rate of false positives |
| Role in decision | Compared to α to reach a decision | The threshold p must beat to reject H₀ |
| Can you change it? | No — it's fixed by your data | Yes — but only before seeing the p-value |
Adjusting the significance level after calculating the p-value to achieve significance — sometimes called "p-hacking" — inflates the actual false positive rate well above the nominal α. Set α first; then calculate p; then compare.
P-Values and Confidence Intervals
P-values and confidence intervals carry the same information in different formats, and they always agree for the same α. If the p-value from a two-tailed test is less than 0.05, the 95% confidence interval for the parameter will not contain the null value — and vice versa.
| P-Value Result | 95% Confidence Interval | Interpretation |
|---|---|---|
| p < 0.05 | Does not contain H₀ value | Statistically significant at α = 0.05 |
| p = 0.05 | Boundary touches H₀ value | Exactly at the threshold |
| p > 0.05 | Contains H₀ value | Not statistically significant at α = 0.05 |
The confidence interval adds something the p-value alone cannot provide: the range of plausible values for the parameter. A 95% confidence interval for the mean tells you both whether the result is significant and how large the effect plausibly is. The ASA and most statistical style guides now recommend reporting both, not just the p-value.
Statistical Significance vs Practical Significance
A p-value tells you whether an effect is real — it says nothing about whether it matters. With a large enough sample, even a trivially small difference becomes statistically significant.
Concrete Example
A drug reduces blood pressure by 0.4 mmHg. Is that meaningful?
With n = 50,000 participants, a 0.4 mmHg reduction might produce p = 0.0001 — highly significant. Clinically, a reduction of less than 1 mmHg is considered negligible. The p-value cannot tell you this. Effect size measures like Cohen's d, odds ratios, or the raw difference in the original units answer the practical question.
| Measure | What It Answers | Example |
|---|---|---|
| P-value | Is there evidence for an effect? | p = 0.003 → Yes, reject H₀ |
| Effect size (Cohen's d) | How large is the effect? | d = 0.12 → very small |
| Confidence interval | What range of values is plausible? | 95% CI: [0.1, 0.7] mmHg reduction |
| Power | What was the chance of detecting a real effect? | 80% power → reasonable study design |
See the Cohen's d guide for the standard effect size measure for t-tests, and the Pearson correlation page for effect size in correlation analysis.
Five Common P-Value Misconceptions
The 2016 ASA statement on p-values identified widespread misuse in published research. The five misconceptions below appear repeatedly in textbooks, papers, and statistical reporting.
| Misconception | What People Believe | What Is Actually True |
|---|---|---|
| #1 — Null probability | p-value = P(H₀ is true) | p-value = P(data this extreme | H₀ is true). These are completely different conditional probabilities. |
| #2 — Proof of hypothesis | p < 0.05 proves the alternative hypothesis | p < 0.05 means data are inconsistent with H₀ at this level. It establishes statistical significance, not truth. |
| #3 — No effect | p > 0.05 means there is no effect | p > 0.05 means the data are insufficient to reject H₀. The effect might exist but the study lacked power to detect it. |
| #4 — Significance = importance | A statistically significant result is practically important | Statistical significance depends on sample size. A large n can produce p < 0.05 for a negligibly small effect. |
| #5 — Replication | p = 0.05 means a 95% chance of replication | The replication rate depends on effect size, power, and study design — not directly on the p-value. |
The American Statistical Association published a landmark 2016 statement authored by Wasserstein & Lazar outlining six principles for proper p-value use, followed by a 2019 special issue. The core message: "A p-value, or statistical significance, does not measure the size of an effect or the importance of a result." It should be one piece of evidence, not the sole decision criterion.
P-Value Decision Tables
Common Significance Thresholds
| P-Value Range | Interpretation | Typical Conclusion |
|---|---|---|
| p < 0.001 | Very strong evidence against H₀ | Reject H₀ at α = 0.001, 0.01, and 0.05 |
| 0.001 ≤ p < 0.01 | Strong evidence against H₀ | Reject H₀ at α = 0.01 and 0.05 |
| 0.01 ≤ p < 0.05 | Moderate evidence; statistically significant | Reject H₀ at α = 0.05 only |
| 0.05 ≤ p < 0.10 | Marginal evidence; not significant at standard α | Fail to reject H₀ at α = 0.05 |
| p ≥ 0.10 | Weak or no evidence against H₀ | Fail to reject H₀ at all common thresholds |
Decision Rule by Alpha Level
| Significance Level (α) | Reject H₀ when | Common Use Case |
|---|---|---|
| α = 0.10 | p < 0.10 | Exploratory research, weak evidence sufficient |
| α = 0.05 | p < 0.05 | Default in most social sciences and business |
| α = 0.01 | p < 0.01 | Medical research, clinical trials |
| α = 0.001 | p < 0.001 | High-stakes decisions, physics experiments |
Real-World Applications
P-values appear in every quantitative field. Here is how each domain uses them in practice.
Clinical Research
Phase III trials use p < 0.05 (often 0.025 per arm) to demonstrate drug efficacy. Regulatory bodies like the FDA require this threshold for approval.
Psychology
Experimental psychology has moved toward reporting effect sizes and confidence intervals alongside p-values following reproducibility concerns in the 2010s.
A/B Testing
Product teams test whether a new button color, page layout, or pricing change produces a statistically significant improvement in conversion rates.
Quality Control
Manufacturers test whether process changes produce mean outputs that differ significantly from specification using one-sample or two-sample t-tests.
Economics
Regression coefficients are reported with p-values to test whether variables like income, education, or policy changes have statistically significant effects.
Genomics
Genome-wide association studies test millions of variants simultaneously, requiring Bonferroni-corrected thresholds as low as p < 5×10⁻⁸ to control false discovery rates.
How to Report P-Values
Reporting standards have tightened across journals. The following guidance reflects APA 7th edition and most major statistical style guidelines.
Report the exact p-value
Write p = 0.032, not "p < 0.05." Exact values give readers more information and allow independent evaluation. Round to two or three decimal places.
Use correct notation for very small values
When p is extremely small, write p < 0.001 rather than p = 0.0000002. Do not write p = 0.000 — this implies zero probability, which is incorrect.
Report alongside the test statistic
Give the full result: t(29) = 3.14, p = 0.004 or χ²(2) = 8.71, p = 0.013. This lets readers verify your calculation and assess it in context.
Include effect size and confidence interval
p-values alone are insufficient. The APA manual recommends reporting effect size (Cohen's d, η², r) and a confidence interval alongside every significant result.
State the conclusion in plain language
Write "there was a significant difference in mean scores, t(38) = 2.67, p = 0.011" — not just "p was significant." The statistic should support a substantive claim.
Related Statistical Concepts
| Concept | Relationship to P-Values | Learn More |
|---|---|---|
| Null Hypothesis (H₀) | P-value assumes H₀ is true during calculation | Null and Alternative Hypotheses |
| Confidence Intervals | Dual relationship: p < α ↔ CI excludes H₀ value | Confidence Intervals Guide |
| Z-Score | Z is the test statistic; p comes from the z-distribution | Z-Score Guide |
| Normal Distribution | Two-tailed p-values use the normal CDF | Normal Distribution |
| Sampling Distribution | P-values rely on the sampling distribution of the test statistic | Sampling Distributions |
| Type I Error | α is the Type I error rate — P(reject H₀ | H₀ true) | Hypothesis Testing |
| Cohen's d | Effect size: complements p-value with magnitude | Cohen's d |
| Degrees of Freedom | Required for t and chi-square p-value calculations | Degrees of Freedom |
| ANOVA | F-statistic maps to a p-value via the F-distribution | ANOVA Guide |
P-Value Calculator
Enter your test parameters below. The calculator supports z-tests and t-tests with one-tailed and two-tailed options. For chi-square tests, use the chi-square calculator. For regression and ANOVA, see the regression calculator and ANOVA calculator.
P-Value Calculator (Z-Test & T-Test)
P-Value Reference Cheat Sheet
| Symbol / Term | Meaning | Typical Value |
|---|---|---|
| p | Probability of data this extreme under H₀ | 0 to 1 |
| α | Pre-set significance level | 0.05 |
| H₀ | Null hypothesis — the default claim being tested | e.g. μ = 0 |
| H₁ | Alternative hypothesis — what you're testing for | e.g. μ ≠ 0 |
| z | Z test statistic (normal distribution, σ known) | (x̄ − μ₀)/(σ/√n) |
| t | T test statistic (t-distribution, σ unknown) | (x̄ − μ₀)/(s/√n) |
| χ² | Chi-square statistic (categorical data) | Σ(O−E)²/E |
| F | F statistic (ANOVA, ratio of variances) | MS_between/MS_within |
| df | Degrees of freedom — affects tail area | n−1 (one-sample t) |
| SE | Standard error of the mean | σ/√n or s/√n |
| β | Type II error rate (probability of missing real effect) | 0.20 common |
| Power | Probability of detecting a real effect (1 − β) | 0.80 common |
Frequently Asked Questions
A p-value is the probability of obtaining a test statistic at least as extreme as the one from your sample data, assuming the null hypothesis is true. It measures how consistent your data are with H₀. A small p-value (say, 0.02) means your data would be unusual if H₀ were true — giving you reason to doubt H₀. It does not tell you the probability that H₀ itself is true.
A p-value of 0.05 means that, assuming H₀ is true, a result this extreme or more extreme would occur 5% of the time by random sampling variation alone. It sits exactly at the conventional threshold — meaning you would just barely reject H₀ at α = 0.05. This is not strong evidence; it is the minimum threshold for calling a result statistically significant by convention.
A smaller p-value means stronger evidence against the null hypothesis. Whether "better" is the right word depends on context. For a researcher trying to detect a real effect, a smaller p-value is more convincing. However, a tiny p-value does not mean the effect is large or important — it can arise from a trivially small effect in a very large sample. Always pair p-values with effect sizes and confidence intervals.
Step 1: Compute the test statistic (z, t, χ², or F) from your data. Step 2: Identify the null distribution and degrees of freedom. Step 3: Find the area in the tail(s) beyond your test statistic using a statistical table. For a z-test, look up |z| in the z-table and take the complementary area; double it for two-tailed. For a t-test, use the t-distribution table with df = n − 1. The interactive calculator on this page automates all of this.
The significance level (α) is a threshold you decide on before collecting data — typically 0.05. The p-value is what you calculate from your data after running the test. The decision rule is: if p < α, reject H₀. Think of α as the bar you've set and p as the score your data achieved. You cannot move the bar after seeing the score.
The five most common: (1) The p-value is the probability H₀ is true — false; it is the probability of the data given H₀. (2) p < 0.05 proves the alternative hypothesis — false; it establishes statistical significance, not truth. (3) A non-significant result means no effect — false; it means insufficient evidence. (4) Statistical significance means the result is important — false; significance is about evidence, not magnitude. (5) A significant p-value will replicate — false; replication depends on power, not just the p-value.
They are two sides of the same inference. A two-tailed test with α = 0.05 produces p < 0.05 if and only if the 95% confidence interval does not contain the null value. The confidence interval adds what the p-value lacks: the range of plausible parameter values. Both should be reported together — the p-value for the decision, the confidence interval for context about magnitude.
In regression output, each coefficient has an associated p-value testing whether that coefficient is significantly different from zero. A small p-value (below your chosen α) for a slope coefficient means the predictor has a statistically significant linear association with the outcome, controlling for other variables in the model. The overall F-test p-value tests whether the model as a whole explains significant variance. Always also report R² and the confidence intervals for coefficients.
No. P-values are probabilities and are always between 0 and 1 inclusive. If a calculation yields a p-value greater than 1, there is an error in the computation. P-values also cannot be negative. A p-value of exactly 0 is theoretically possible only for a perfectly deterministic outcome, which does not occur in practice with real data.
Direct comparison only makes sense when both p-values come from the same type of test on the same type of data. A smaller p-value indicates stronger evidence against H₀ in that particular test. You cannot conclude that a result with p = 0.01 is "twice as important" as one with p = 0.02 — the p-value scale is not linear in evidence strength. For comparing results across studies, meta-analysis and effect sizes are more informative than raw p-value comparisons.
APA 7th edition recommends: (1) Report exact p-values to two or three decimal places, e.g., p = 0.032. (2) For very small values, write p < 0.001. (3) Never write p = 0.000. (4) Include the test statistic, degrees of freedom, and p-value together: t(24) = 2.45, p = 0.022. (5) Omit the leading zero before the decimal since p cannot exceed 1: write p = .032 in APA style, though p = 0.032 is also widely accepted.