Inferential Statistics Statistics Data Analysis 22 min read April 28, 2026
BY: Statistics Fundamentals Team
Reviewed By: Minsa A (Senior Statistics Editor)

Hypothesis Testing in Statistics: The Complete Guide to Steps, Types & Examples

Hypothesis testing is a statistical procedure used to decide whether sample data provides enough evidence to reject a claim about a population.

It is the foundation of inferential statistics — the tool researchers, data scientists, and analysts use every time they need to separate a real signal from random noise.

This guide covers every element: the logic behind hypothesis testing, the 5-step procedure, p-values and significance levels, every major test type, Type I and II errors, statistical power, worked examples, and the relationship between hypothesis tests and confidence intervals.

What Is Hypothesis Testing?

Hypothesis testing is a formal statistical procedure for evaluating whether sample data supports or contradicts a claim about a population. You start with a specific claim — the null hypothesis — and use probability to decide whether the data is consistent with that claim or whether the evidence against it is strong enough to reject it.

Think of it like a trial. The null hypothesis is the presumption of innocence: the defendant is innocent until the evidence proves otherwise beyond a reasonable doubt. In hypothesis testing, "beyond a reasonable doubt" is defined by a probability threshold called the significance level.

💡
The Core Question

Every hypothesis test answers the same question: "If the null hypothesis were true, how likely is it that we would observe data at least this extreme?" The p-value gives that probability.

🔑 Key Takeaways

The most important ideas every student and analyst should understand before running a hypothesis test.

The null hypothesis is always what you test. You never directly test the alternative hypothesis — you test whether the null is consistent with your data.

You never "accept" the null hypothesis. You either reject it or "fail to reject" it — absence of evidence is not evidence of absence.

Statistical significance is not practical significance. A result can be statistically significant with a tiny, meaningless effect size when the sample is very large.

Set α before collecting data. Choosing the significance level after seeing the results — "p-hacking" — inflates false positive rates.

The test statistic quantifies the gap between observed and expected. The larger it is, the less likely the data fits the null hypothesis.

Choosing the wrong test type invalidates results. The choice between z, t, chi-square, and ANOVA depends on data type, sample size, and number of groups.

Null and Alternative Hypotheses

Every hypothesis test begins with two competing statements about a population parameter.

The Null Hypothesis (H₀)

The null hypothesis is the default claim — it asserts that there is no effect, no difference, or no relationship in the population. It is the statement you are testing. Common formulations:

  • H₀: μ = μ₀ (the population mean equals a specific value)
  • H₀: μ₁ = μ₂ (two group means are equal)
  • H₀: p = p₀ (a proportion equals a specific value)
  • H₀: variables X and Y are independent
⚠️
Critical wording distinction

Say "fail to reject H₀," never "accept H₀." When a test finds no significant result, it means the data didn't provide enough evidence to rule out the null — not that the null is proven true. The difference matters in research and in exams.

The Alternative Hypothesis (H₁ or Hₐ)

The alternative hypothesis is the research claim — the effect or difference you expect to find. It is what you gather evidence for. It must be the logical complement of H₀, and it can be directional or non-directional:

⬅️➡️

Two-tailed

H₁: μ ≠ μ₀
Tests for any difference, in either direction.

➡️

Right-tailed

H₁: μ > μ₀
Tests specifically for an increase above the null value.

⬅️

Left-tailed

H₁: μ < μ₀
Tests specifically for a decrease below the null value.

Real-World Example

Testing a New Blood Pressure Drug

A pharmaceutical company wants to know whether a new drug reduces systolic blood pressure compared to a placebo. The hypotheses are:

H₀: The drug has no effect on blood pressure (μdrug = μplacebo)
H₁: The drug reduces blood pressure (μdrug < μplacebo) — left-tailed

A two-sample t-test is run. If p < 0.05, the evidence is strong enough to reject H₀ and conclude the drug has a statistically significant effect.

The 5 Steps of Hypothesis Testing

Every hypothesis test — regardless of the test statistic used — follows the same five-step procedure. The arithmetic changes; the logic does not.

1

State the null and alternative hypotheses

Write H₀ (the claim to be tested, containing equality) and H₁ (the research claim, specifying direction if appropriate). Both must be stated before looking at any data.

2

Choose the significance level (α)

Set α before data collection — typically 0.05 for most research, 0.01 for medical or safety-critical decisions. This is the maximum tolerable probability of a false positive (Type I error).

3

Select the appropriate test and compute the test statistic

Choose the test based on data type, number of groups, and whether population parameters are known (see Types of Tests below). Calculate the test statistic (z, t, F, or χ²) from your sample data using the relevant formula.

4

Find the p-value or compare to the critical value

P-value approach: Calculate the probability of observing a test statistic at least as extreme as yours, assuming H₀ is true. If p ≤ α, reject H₀.
Critical value approach: Find the boundary value at α from statistical tables. If your test statistic exceeds it, reject H₀.

5

Draw a conclusion in context

State whether you reject or fail to reject H₀, then translate that decision into a plain-language conclusion about the original research question. Report the test statistic, degrees of freedom, and p-value in your write-up.

The P-value Explained

The p-value is the probability of obtaining a test statistic at least as extreme as the one calculated from your sample, assuming the null hypothesis is true. It is the most widely reported measure of evidence against H₀.

How to Interpret the P-value

P-value Evidence Against H₀ Decision at α = 0.05
p > 0.10Weak or noneFail to reject H₀
0.05 < p ≤ 0.10Marginal / suggestiveFail to reject H₀
0.01 < p ≤ 0.05ModerateReject H₀
0.001 < p ≤ 0.01StrongReject H₀
p ≤ 0.001Very strongReject H₀
⚠️
What the p-value does NOT mean

The p-value is not the probability that H₀ is true. It is not the probability that the result was due to chance. And p < 0.05 does not prove that your hypothesis is correct — it only means the data is unlikely under the null. Effect size and confidence intervals carry equally important information.

Significance Level (α) vs P-value

The significance level α is chosen before the test. The p-value is calculated after the test. The decision rule is simple: if the p-value falls at or below α, reject H₀. If not, fail to reject it.

0.05
Standard α in most research disciplines
0.01
Stricter α used in medical and safety contexts
5 × 10⁻⁷
α threshold used in particle physics (five-sigma rule)

Test Statistic and Critical Value

The test statistic is a single number calculated from sample data that measures how far the observed result deviates from what H₀ predicts. The further it falls from the center of the null distribution, the stronger the evidence against H₀.

General form of a test statistic Test Statistic = (Observed Value − Null Value) ÷ Standard Error

The critical value is the threshold the test statistic must exceed to reject H₀ at the chosen α level. It is read from a statistical table (z-table, t-table, chi-square table, or F-table) based on α and the degrees of freedom.

Test Statistic Distribution Used Common Use Case Key Input
ZStandard normalMean or proportion, large n or known σPopulation σ known
tt-distributionMean, small n or unknown σDegrees of freedom = n − 1
χ² (chi-square)Chi-square distributionCategorical data, goodness-of-fit, independencedf = (rows−1)(cols−1)
FF-distributionComparing 3+ group means (ANOVA), variancesdf between and within groups

For one- and two-tailed z-tests, the most commonly referenced critical values at α = 0.05 are ±1.96 (two-tailed) and 1.645 (one-tailed right) or −1.645 (one-tailed left). See the Z-table and T-distribution table for full critical value reference.

Types of Hypothesis Tests

Choosing the right test is as important as running it correctly. The decision depends on: how many groups you are comparing, the data type (continuous vs. categorical), whether the population variance is known, and whether the samples are independent.

Z-test

Used to test a population mean or proportion when the population standard deviation (σ) is known, or when the sample size is large enough (n ≥ 30) for the central limit theorem to apply.

Z-test for a mean z = (x̄ − μ₀) ÷ (σ ÷ √n)

T-test

The most common test for comparing means. Used when the population standard deviation is unknown and estimated from the sample. Three main forms:

One-sample t-test

When to use

Comparing a single sample mean to a known or hypothesized population value. Example: does the average delivery time differ from the advertised 5 days?

Two-sample (independent) t-test

When to use

Comparing means from two separate, independent groups. Example: do men and women differ in average resting heart rate?

Paired t-test

When to use

Comparing means from matched pairs or the same subjects measured twice. Example: does a training program improve test scores (before vs. after)?

One-sample t-test statistic t = (x̄ − μ₀) ÷ (s ÷ √n) df = n − 1

Chi-Square Test

Used for categorical data. Two main applications: the goodness-of-fit test (does an observed frequency distribution match an expected one?) and the test of independence (are two categorical variables related?).

Chi-square test statistic χ² = Σ [(Observed − Expected)² ÷ Expected]

The null hypothesis for a chi-square test of independence states that the two variables are not related. The null hypothesis for a goodness-of-fit test states that the observed frequencies match the expected distribution. Reference the chi-square table for critical values.

ANOVA (Analysis of Variance)

ANOVA tests whether three or more group means differ, using the F-statistic. It partitions total variance into between-group variance and within-group variance. A significant F-statistic tells you that at least one group differs — post-hoc tests (Tukey, Bonferroni) identify which pairs.

F-statistic for one-way ANOVA F = Mean Square Between (MSB) ÷ Mean Square Within (MSW)

The null hypothesis for ANOVA is H₀: μ₁ = μ₂ = ... = μₖ (all group means are equal). See the F-table for critical values.

Proportion Test (Z-test for Proportions)

Tests whether a sample proportion differs from a hypothesized value, or whether two sample proportions differ from each other. Based on the normal approximation to the binomial, valid when np ≥ 5 and n(1−p) ≥ 5.

Z-test for one proportion z = (p̂ − p₀) ÷ √[p₀(1 − p₀) ÷ n]

Which Test Should You Use?

Scenario Data Type Groups Use This Test
Compare one mean to a value, σ known or n ≥ 30Continuous1Z-test
Compare one mean to a value, σ unknownContinuous1One-sample t-test
Compare two independent group meansContinuous2Two-sample t-test
Compare before/after or matched pairsContinuous2 (matched)Paired t-test
Compare 3+ group meansContinuous3+One-way ANOVA
Test association between two categorical variablesCategoricalChi-square independence test
Test whether frequencies match a distributionCategoricalChi-square goodness-of-fit
Compare one proportion to a valueProportion1One-sample z-test for proportions
Compare two proportionsProportion2Two-sample z-test for proportions

Type I and Type II Errors

Because hypothesis tests use probability, they can lead to two kinds of incorrect conclusions. Understanding these errors — and what controls them — is central to interpreting statistical results correctly.

Type I Error (False Positive)

Rejecting H₀ when H₀ is actually true. You conclude there is an effect, but in reality there isn't one.

Probability: Equal to the significance level α.

Example: Concluding a drug works when the improvement was due to chance.

Controlled by setting α before the test

Type II Error (False Negative)

Failing to reject H₀ when H₀ is actually false. You miss a real effect because the evidence wasn't strong enough to detect it.

Probability: Called β. Reduced by increasing sample size and effect size.

Example: Concluding a drug doesn't work when it actually does — the study was just underpowered.

Controlled by increasing statistical power (1 − β)
Decision \ Reality H₀ is True H₀ is False
Reject H₀ Type I Error (α) Correct (Power = 1 − β)
Fail to reject H₀ Correct (1 − α) Type II Error (β)
⚖️
The Error Trade-off

Reducing α (making it harder to reject H₀) decreases Type I errors but increases Type II errors, because a stricter threshold makes it harder to detect real effects. The two error types are inversely related at a fixed sample size. The solution is to increase sample size — this reduces both errors simultaneously.

Statistical Power of a Hypothesis Test

Statistical power is the probability that a test correctly rejects a false null hypothesis. It equals 1 − β. A power of 0.80 means an 80% chance of detecting a real effect when one exists — the conventional minimum in behavioral and social research.

Power must be calculated before data collection, not after. Running a study without a power calculation risks wasting resources on a test too weak to detect the effect of interest. A "non-significant" result in an underpowered study cannot be interpreted as evidence of no effect.

📏

Sample size

The single most controllable driver of power. Doubling n substantially increases power. Calculate the minimum n required before starting.

📐

Effect size

Larger true differences are easier to detect. Small effects require much larger samples. Use Cohen's d for means, r for correlations, and Cohen's w for chi-square tests.

🎯

Significance level (α)

A higher α (e.g., 0.10 vs. 0.05) increases power but also raises the false positive rate. This tradeoff must be justified by the context.

📊

Variability

Less variability in the outcome makes it easier to detect true effects. Tighter measurement instruments and standardized conditions improve power.

0.80
Conventional minimum power threshold in most research fields
>50%
Type II error rate in many underpowered clinical trials
Approximate increase in n needed to halve the effect size detectable

Confidence Intervals and Hypothesis Testing

Hypothesis tests and confidence intervals are two sides of the same coin — they draw on the same mathematical framework and reach equivalent conclusions.

A 95% confidence interval gives the range of population parameter values consistent with your sample at α = 0.05. The hypothesis test at α = 0.05 asks whether the null value falls inside that range:

  • If the null hypothesis value falls outside the 95% CI → the two-tailed test rejects H₀ at α = 0.05.
  • If the null hypothesis value falls inside the 95% CI → the test fails to reject H₀ at α = 0.05.
Feature Hypothesis Test Confidence Interval
OutputBinary: reject or fail to reject H₀Range of plausible parameter values
Information providedWhether an effect existsDirection and magnitude of the effect
Linked to αDirectly — α sets the decision boundaryConfidence level = 1 − α (e.g., 95% CI for α = 0.05)
Preferred whenSimple yes/no decision neededEffect size and precision are important
Equivalent?Yes — at α = 0.05, a two-tailed test and 95% CI always agree on the reject/fail-to-reject decision
Best Practice

Report both the hypothesis test result (p-value, test statistic, df) and the confidence interval. Together they tell readers whether an effect was detected and how large it might plausibly be — far more informative than a p-value alone.

Hypothesis Testing: Worked Examples

Example 1: One-Sample T-test

Worked Example

Do students score differently from the national average?

A school reports that the national average on a standardized exam is 75 points. A sample of 25 students from one school scored a mean of 78.4 with a standard deviation of 9.2. Test at α = 0.05 whether this school's students perform differently from the national average.

1

State hypotheses: H₀: μ = 75 (school mean equals national average)  |  H₁: μ ≠ 75 (two-tailed, because "differently" means either direction)

2

Significance level: α = 0.05. Two-tailed test → each tail = 0.025.

3

Calculate t: t = (78.4 − 75) ÷ (9.2 ÷ √25) = 3.4 ÷ 1.84 = 1.848. Degrees of freedom = 25 − 1 = 24.

4

Critical value: From the t-distribution table at df = 24 and α/2 = 0.025, the critical value is ±2.064. Since |1.848| < 2.064, the test statistic does not fall in the rejection region. The p-value ≈ 0.077.

5

Decision: p = 0.077 > α = 0.05. Fail to reject H₀.

At the 5% significance level, there is not enough evidence to conclude that students at this school perform differently from the national average. Note this is not proof that no difference exists — the sample size was small.

Example 2: Chi-Square Test of Independence

Worked Example

Is there an association between exercise frequency and sleep quality?

A survey of 200 adults cross-tabulates exercise frequency (regular / irregular) against reported sleep quality (good / poor). The observed counts are: regular/good = 72, regular/poor = 28, irregular/good = 48, irregular/poor = 52. Test at α = 0.05 whether exercise and sleep quality are independent.

1

State hypotheses: H₀: Exercise frequency and sleep quality are independent. H₁: They are not independent.

2

Significance level: α = 0.05. df = (2−1)(2−1) = 1. Critical value from the chi-square table = 3.841.

3

Calculate expected frequencies: Row totals: regular = 100, irregular = 100. Column totals: good = 120, poor = 80. Expected: reg/good = 60, reg/poor = 40, irreg/good = 60, irreg/poor = 40.

4

Calculate χ²: = (72−60)²/60 + (28−40)²/40 + (48−60)²/60 + (52−40)²/40 = 2.4 + 3.6 + 2.4 + 3.6 = 12.0. p-value < 0.001.

5

Decision: χ² = 12.0 > critical value 3.841, and p < 0.001 < α. Reject H₀.

There is statistically significant evidence that exercise frequency and sleep quality are associated (χ²(1) = 12.0, p < 0.001). Regular exercisers reported better sleep more often than would be expected by chance. This is an association, not causation.

Example 3: Z-test for Proportions

Worked Example

Did a new website design improve the conversion rate?

A company's old website had a 12% conversion rate. After redesigning the site, 84 out of 600 visitors converted (14%). Test at α = 0.05 whether the new design has a higher conversion rate than the old one.

1

State hypotheses: H₀: p = 0.12 (conversion rate is unchanged). H₁: p > 0.12 (conversion rate increased) — right-tailed test.

2

Significance level: α = 0.05. Critical z = 1.645 (right-tailed).

3

Calculate z: p̂ = 84/600 = 0.14. z = (0.14 − 0.12) ÷ √[0.12 × 0.88 ÷ 600] = 0.02 ÷ 0.01321 = 1.514.

4

Compare: z = 1.514 < 1.645. p-value ≈ 0.065.

5

Decision: p = 0.065 > α = 0.05. Fail to reject H₀.

At the 5% level, the data does not provide sufficient evidence to conclude that the new design increased the conversion rate. The observed 2-percentage-point gain (from 12% to 14%) is not statistically significant at this sample size, though it is suggestive (p = 0.065). A larger sample would provide more conclusive results.

Advanced Topics in Hypothesis Testing

Multiple Hypothesis Testing

When you run many tests simultaneously — for example, comparing 10 biomarkers across two groups — the probability of at least one false positive increases rapidly. With 20 independent tests at α = 0.05, you expect at least one false positive by chance alone.

The Bonferroni correction adjusts α by dividing it by the number of tests: if you run k tests, use α/k as the threshold for each. More powerful alternatives include the Benjamini-Hochberg procedure, which controls the false discovery rate rather than the family-wise error rate.

Bayesian Hypothesis Testing

Classical (frequentist) hypothesis testing answers: "If H₀ is true, how likely is this data?" Bayesian hypothesis testing answers a different question: "Given this data, how likely is H₀ relative to H₁?" The Bayes factor is the ratio of these likelihoods.

Bayesian methods require specifying prior beliefs about the parameters, which can be an advantage (they incorporate existing knowledge) or a subject of debate (the prior choice influences conclusions). For routine research, frequentist methods remain standard; Bayesian approaches are particularly useful when prior information is strong or sequential testing is needed.

Hypothesis Testing in Regression Analysis

In linear regression, hypothesis tests evaluate whether individual predictors significantly explain variance in the outcome. The null hypothesis for each coefficient is H₀: β = 0 (the predictor has no effect on Y, holding others constant). The t-statistic = coefficient ÷ standard error, compared against a t-distribution. The overall model fit is tested with an F-test (H₀: all slopes = 0 simultaneously).

Bootstrap Hypothesis Testing

Bootstrap tests make no distributional assumptions. They estimate the null distribution by resampling from the observed data thousands of times and measuring how often the resampled statistic exceeds the observed one. Useful when sample sizes are small or the test statistic does not follow a known distribution.

Hypothesis Testing at a Glance

Concept Definition Key Point
H₀ (Null hypothesis)The claim of no effect or no differenceAlways contains equality; never "accepted," only failed to reject
H₁ (Alternative)The research claim — what you expect to findCan be one-tailed or two-tailed
Significance level (α)Threshold for rejecting H₀Set before data collection; usually 0.05
Test statisticMeasured deviation from the null valueZ, t, χ², or F depending on test type
P-valueProbability of this data if H₀ were trueIf p ≤ α, reject H₀
Type I errorRejecting a true H₀Probability = α
Type II errorFailing to reject a false H₀Probability = β; reduced by larger n
Statistical powerProbability of detecting a real effectPower = 1 − β; target ≥ 0.80

Conclusion

Hypothesis testing gives you a principled, repeatable method for making decisions from data under uncertainty. The procedure is consistent across every test type: state the hypotheses, set α, calculate the test statistic, find the p-value, and draw a conclusion in context.

Getting it right means more than following the steps. It means choosing the correct test for your data, setting α before you look at results, reporting effect sizes alongside p-values, and interpreting "fail to reject" as absence of evidence — not as proof the null is true.

Statistical significance and practical significance are different things. A large sample can make a trivial difference statistically significant. Always pair your test result with a confidence interval and, where appropriate, a measure of effect size. That combination tells readers what they need to know: whether an effect was detected, and how large it might plausibly be.

Read More on Statistics Fundamentals

Statistics and Probability

The foundational principles behind sampling distributions and probability — essential context for hypothesis testing.

Read More →

Study Design

How hypotheses are embedded in research — from observational studies to RCTs. Understanding design determines which test is appropriate.

Read More →

Z-table (Standard Normal)

Look up critical values and probabilities for z-tests and two-tailed hypothesis tests at common significance levels.

View Z-table →

T-distribution Table

Find critical t-values for one- and two-sample t-tests at any degrees of freedom and significance level.

View T-table →

Chi-square Table

Reference chi-square critical values for goodness-of-fit and tests of independence across all common degrees of freedom.

View Chi-square Table →

F-table

Critical F-values for ANOVA and variance tests at α = 0.05 and 0.01, by numerator and denominator degrees of freedom.

View F-table →

Descriptive Statistics

Means, standard deviations, and distributions — the building blocks that feed into every hypothesis test formula.

Read More →

Probability Calculator

Compute probabilities for normal, binomial, and other distributions — useful for calculating p-values by hand.

Use Calculator →

FAQs About Hypothesis Testing

Hypothesis testing is a statistical procedure used to decide whether sample data provides enough evidence to reject a claim about a population parameter. You compare observed data to what would be expected if the null hypothesis were true, using the p-value or a critical value to guide the decision.

The p-value is the probability of observing a test statistic at least as extreme as the one calculated from your sample, assuming the null hypothesis is true. It is not the probability that H₀ is true. A p-value ≤ α (typically 0.05) leads to rejecting H₀.

Reject the null hypothesis when the p-value is less than or equal to the pre-set significance level α (usually 0.05). Equivalently, reject H₀ when the calculated test statistic exceeds the critical value at the chosen α. Both approaches give the same decision.

A Type I error (false positive) is rejecting a null hypothesis that is actually true. Its probability equals α, the significance level. A Type II error (false negative) is failing to reject a null hypothesis that is actually false. Its probability is β. Statistical power = 1 − β. Reducing α increases β, and vice versa, at a fixed sample size.

A two-tailed test checks for a difference in either direction (H₁: μ ≠ μ₀) and splits α between both tails of the distribution. A one-tailed test checks only one direction — either H₁: μ > μ₀ (right-tailed) or H₁: μ < μ₀ (left-tailed) — and concentrates all of α in one tail. Two-tailed tests are the default unless you have a strong directional prediction established before data collection.

Use a z-test when the population standard deviation (σ) is known, or when the sample is large (n ≥ 30) and the central limit theorem applies. Use a t-test when σ is unknown and must be estimated from the sample — which is the situation in the vast majority of real research. As sample size grows, the t-distribution approaches the normal distribution, so the distinction matters most in small samples.

Statistical power (1 − β) is the probability that a test will detect a real effect when one exists. A power of 0.80 means an 80% chance of a true positive. Underpowered studies — those with insufficient sample sizes — frequently produce false negatives and waste research resources. Power should be calculated before data collection to determine the minimum sample size required.

Hypothesis testing produces a binary decision: reject or fail to reject H₀. A confidence interval gives a range of plausible values for the population parameter, providing information about the magnitude and direction of the effect. The two are mathematically equivalent at the same α level: if a 95% CI excludes the null value, the two-tailed test at α = 0.05 will reject H₀, and vice versa. Reporting both is best practice.