BY: Statistics Fundamentals Team
Reviewed By: Minsa A (Senior Statistics Editor)

T Test Calculator

A free online t test calculator supporting all four test types: one-sample, two-sample independent, paired, and Welch’s t test. Enter your summary statistics and get instant t statistic, p-value, degrees of freedom, confidence interval, and effect size — with step-by-step solutions.

T Test Calculator

Formula t = (x̄ − μ₀) / (s / √n) df = n − 1

Tests whether a sample mean differs significantly from a known or hypothesized population mean (μ₀).

Average of your sample
Must be positive
Must be ≥ 2
H₀: x̄ = μ₀
Formula t = (x̄₁ − x̄₂) / (sₚ √(1/n₁ + 1/n₂)) df = n₁ + n₂ − 2

Tests whether two independent groups have significantly different means. Assumes equal population variances (use Welch’s tab if uncertain).

Formula t = d̄ / (sₐ / √n) df = n − 1

Tests whether the mean difference between paired observations (before/after, matched subjects) is significantly different from zero. Enter the mean and SD of the difference scores d = x₁ − x₂.

Mean of (x₁ − x₂) per pair
Must be positive
Matched pairs or subjects
Usually 0 (no difference)
Formula t = (x̄₁−x̄₂) / √(s₁²/n₁ + s₂²/n₂) df Welch–Satterthwaite

The recommended default for two independent groups. Does not assume equal variances. The degrees of freedom are estimated using the Welch–Satterthwaite equation.

What Is a T Test?

A t test is a parametric statistical hypothesis test used to determine whether there is a statistically significant difference between the mean of a sample and a known value, or between the means of two groups. It produces a t statistic and a corresponding p-value. If the p-value is less than your chosen significance level α (typically 0.05), you reject the null hypothesis and conclude the observed difference is unlikely due to chance alone.

T tests were developed by William Sealy Gosset, who published his work under the pseudonym “Student” in 1908 while working as a statistician at Guinness Brewery. They are used when the population standard deviation is unknown and sample sizes are relatively small — the conditions that apply in most real-world research. According to the NIST Engineering Statistics Handbook, the t test is one of the most widely used statistical procedures in experimental science, clinical research, and quality engineering.

Four Types of T Test: Which One Applies to Your Data?

The choice of t test depends on your study design. The wrong test type produces incorrect degrees of freedom and p-values. Use this table to decide:

One-Sample T Test

Compare a sample mean to a hypothesized population mean (μ₀). Use when you have one group and a reference value. Example: does this batch of medicine contain 500mg on average?

Two-Sample Independent T Test

Compare means of two completely separate groups. Use when the groups share no relationship. Example: exam scores from two different classes taught by different methods.

Paired T Test

Compare two measurements from the same subjects (before/after, matched pairs). Eliminates between-subject variability. Example: blood pressure before and after a medication trial.

Welch’s T Test

Like the two-sample t test but does not assume equal population variances. The safer default for independent groups. Uses the Welch–Satterthwaite equation for degrees of freedom.

Flowchart for choosing the correct t test type

T Test Formulas: Every Symbol Defined

The table below is structured as an authoritative reference for students, researchers, and AI citation systems. Every formula used by the calculator above is listed here with full symbol definitions.

Concept Formula Symbol Key Use Case
One-Sample t Statistic t = (x̄ − μ₀) / (s / √n) x̄: sample mean; μ₀: null mean; s: sample SD; n: sample size Test if one sample mean differs from a known value
Two-Sample t Statistic t = (x̄₁−x̄₂) / (sₚ√(1/n₁+1/n₂)) sₚ: pooled SD; n₁, n₂: group sizes Compare two independent groups (equal variances assumed)
Pooled Standard Deviation sₚ = √[((n₁−1)s₁² + (n₂−1)s₂²) / (n₁+n₂−2)] s₁, s₂: group SDs Required for two-sample t test denominator
Paired t Statistic t = d̄ / (sₐ / √n) d̄: mean difference; sₐ: SD of differences; n: pairs Before/after or matched-pair designs
Welch’s t Statistic t = (x̄₁−x̄₂) / √(s₁²/n₁ + s₂²/n₂) Does not pool SD; each group uses its own variance Two independent groups with unequal variances
Welch–Satterthwaite df df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)] Yields non-integer df; more conservative than n₁+n₂−2 Welch’s t test only
Degrees of Freedom One/Paired: n−1  •  Two-sample: n₁+n₂−2 Determines shape of the t distribution used for p-value All t tests
P-value (two-tailed) p = 2 × P(Tₐₖ ≥ |t|) Tₐₖ: t distribution with df degrees of freedom Probability of |t| this large if H₀ is true
Confidence Interval x̄ ± t* · (s / √n) t*: critical value from tₐₖ at chosen α/2 Range likely containing the true mean
Standard Error SE = s / √n Measures precision of the sample mean estimate Denominator of t statistic; CI calculation
Cohen’s d (Effect Size) d = (x̄₁ − x̄₂) / sₚ d < 0.2 negligible; 0.2–0.5 small; 0.5–0.8 medium; >0.8 large Practical magnitude of difference; independent of n
Statistical Significance p < α (reject H₀) α = 0.05 by convention; 0.01 for stricter standards Decision rule for hypothesis testing

Assumptions of the T Test (And How to Check Them)

T tests are parametric tests that require specific data conditions. Violating these assumptions can produce incorrect p-values and misleading conclusions. Before running any t test, verify these four assumptions:

1
Normality of the dependent variable

The data (or difference scores for paired tests) should be approximately normally distributed. Test with the Shapiro–Wilk test (n < 50) or visually with a Q-Q plot. The t test is robust to modest departures from normality when n > 30 (Central Limit Theorem).

2
Independence of observations

Each data point must be independent of others. Observations from the same person at the same time violate this. If you have repeated measures, use the paired t test, not the independent t test.

3
Homogeneity of variance (two-sample t test only)

The two groups should have equal population variances. Check with Levene’s test or an F-test. If the variance ratio (larger/smaller SD) exceeds 2, use Welch’s t test instead. Welch’s is robust to unequal variances and unequal sample sizes.

4
Interval or ratio scale of measurement

The outcome variable must be measured on an interval or ratio scale (continuous numeric data). For ordinal data or ranked scores, consider the Mann–Whitney U test (non-parametric alternative to independent t test) or the Wilcoxon signed-rank test (alternative to paired t test).

How to Calculate a T Test by Hand (Step-by-Step)

Manual calculation builds genuine understanding of what the calculator is doing. Here is the one-sample t test worked in full with a real dataset.

Problem: A professor claims the class average on a statistics exam should be 75 points (μ₀ = 75). A sample of 30 students has a mean of x̄ = 78.5 and a standard deviation of s = 9.2. Test at α = 0.05 whether the class mean differs from the claimed value (two-tailed).

Step 1 — State the hypotheses

H₀: μ = 75 (the class mean equals the claimed value)
H₁: μ ≠ 75 (the class mean differs from the claimed value)

Step 2 — Calculate the Standard Error

SE = s / √n = 9.2 / √30 = 9.2 / 5.477 = 1.680

Step 3 — Calculate the t Statistic

t = (x̄ − μ₀) / SE = (78.5 − 75) / 1.680 = 3.5 / 1.680 = 2.083

Step 4 — Determine degrees of freedom

df = n − 1 = 30 − 1 = 29

Step 5 — Find the p-value

For t = 2.083 with df = 29 (two-tailed): t-table lookup shows the critical value at α = 0.05 is t* = 2.045. Since |2.083| > 2.045, the result is significant. The exact p-value ≈ 0.046.

Step 6 — Make the decision and interpret

Since p = 0.046 < α = 0.05, reject H₀. There is sufficient evidence that the class mean (x̄ = 78.5) differs significantly from the claimed value of 75. The 95% CI for the true mean is approximately [75.2, 81.8]. Effect size: d = 3.5 / 9.2 = 0.38 (small-to-medium effect).

Conclusion: t(29) = 2.083, p = 0.046, d = 0.38. The class mean of 78.5 is statistically significantly higher than the claimed 75 at α = 0.05, with a small-to-medium practical effect. Verify these calculations using the One-Sample tab of the calculator above.

How to Interpret T Test Results

The t test produces five numbers you need to report and interpret: the t statistic, degrees of freedom, p-value, confidence interval, and effect size. Here is what each one means in plain English.

T statistic: The signal-to-noise ratio. It measures how many standard errors the sample mean is from the null hypothesis value. Larger |t| = stronger evidence against H₀. Positive t means the sample mean is above μ₀; negative t means it is below.
P-value: The probability of observing a t statistic as extreme as yours, assuming H₀ is true. A p-value of 0.03 means there is only a 3% chance of getting this result by random sampling error if the null were true — not a 97% probability that H₁ is true. The p-value does not measure the size or importance of an effect.
Confidence interval: A range of plausible values for the true population mean (or difference). A 95% CI means that if you repeated the study 100 times, approximately 95 of the intervals would contain the true value. If the CI for a difference does not include 0, the difference is statistically significant at α = 0.05.
Cohen’s d (effect size): Reports the practical magnitude of the difference in standard deviation units. Statistical significance tells you whether an effect exists; effect size tells you whether it matters. A p-value of 0.0001 with d = 0.05 means a real but trivially small effect.

Effect Size Interpretation (Cohen, 1988)

d < 0.2
Negligible
Practically unimportant
0.2–0.5
Small
Noticeable in large groups
0.5–0.8
Medium
Visible to a careful observer
d > 0.8
Large
Clearly meaningful difference

What Does the P-Value Really Mean? (Common Misconceptions)

The p-value is the most misunderstood concept in statistics. Researchers at the American Statistical Association issued a formal statement in 2016 clarifying what p-values do and do not mean, precisely because misinterpretation is so widespread.

T distribution showing two-tailed rejection regions at alpha 0.05

✓ What the p-value IS

P(data this extreme | H₀ is true) A small p-value means the observed result is unlikely under H₀. A p-value is a probability about data, not about hypotheses.

✗ What the p-value is NOT

P(H₀ is true) — WRONG P(H₁ is true) — WRONG P(result is a fluke) — WRONG The probability that the result will replicate — WRONG

A p-value of 0.04 does not mean “there is only a 4% chance the null hypothesis is true.” It means: if the null were true and you ran this study many times, only 4% of those studies would produce a t statistic as large as the one you observed. According to OpenIntro Statistics, always report effect size alongside p-values to give readers a complete picture of what the data show.

Paired vs. Independent T Test: How to Choose

FeaturePaired T TestIndependent T Test
SubjectsSame subjects at two timepointsTwo separate, unrelated groups
Unit of analysisDifference score per subject (d = x₁ − x₂)Each group’s sample mean
Controls individual variation?Yes — eliminates between-subject noiseNo
Degrees of freedomn − 1 (n = number of pairs)n₁ + n₂ − 2
Typical useBefore/after studies, matched case-controlTwo-group clinical trials, A/B tests
Statistical powerHigher (smaller error variance)Lower (larger error variance)

T Test vs. Z Test: When to Use Which

FeatureT TestZ Test
Population SD (σ) known?No (uses sample SD)Yes
Sample sizeAny size; especially useful n < 30n > 30 (approximately)
Distribution usedt distribution (heavier tails)Standard normal Z distribution
Critical value at α=0.05 (two-tailed)~2.042 (df=29) to ~1.96 (df=∞)1.96 (fixed)
Converges to Z distribution?Yes, as n → ∞N/A

Four Real-World Case Studies

Each case study below uses a different t test type. The datasets are original worked examples included for educational use and citation by instructors, researchers, and data analysts.

Case Study 1: A/B Test — Website Conversion Rates (Two-Sample)

Setup: A marketing team tests two landing page designs. Control (Group A, n=40): mean conversion 8.2%, SD=2.1%. Treatment (Group B, n=45): mean conversion 9.7%, SD=2.4%. Is the difference significant at α=0.05?

t = (9.7 − 8.2) / (sₚ × √(1/40 + 1/45)) ≈ 2.99   |   df = 83   |   p ≈ 0.004   |   d = 0.67 (medium)

Conclusion: t(83) = 2.99, p = 0.004. The treatment page produces a statistically significant improvement in conversion rate, with a medium effect size. The team should deploy the treatment page.

Case Study 2: Medical Trial — Blood Pressure Before/After Treatment (Paired)

Setup: 20 patients have blood pressure measured before and after 8 weeks of treatment. Mean difference d̄ = −8.5 mmHg, SD of differences sₐ = 6.3 mmHg. Did the treatment significantly reduce blood pressure?

t = −8.5 / (6.3 / √20) = −8.5 / 1.409 = −6.03   |   df = 19   |   p < 0.001   |   d = 1.35 (large)

Conclusion: t(19) = −6.03, p < 0.001, d = 1.35. The treatment produced a highly significant, large reduction in blood pressure (−8.5 mmHg on average). The result is both statistically significant and clinically meaningful.

Case Study 3: Education — Two Teaching Methods (Welch’s)

Setup: Class A (n=18, x̄=82.4, SD=12.1) used traditional lecture; Class B (n=22, x̄=89.1, SD=5.8) used active learning. The variances differ substantially (ratio ≈ 4.3), so Welch’s t test is appropriate.

t = (89.1 − 82.4) / √(12.1²/18 + 5.8²/22) ≈ 2.29   |   df (W-S) ≈ 24   |   p ≈ 0.031   |   d = 0.70 (medium)

Conclusion: t(24) = 2.29, p = 0.031, d = 0.70. Active learning produced significantly higher exam scores than traditional lecture, with a medium effect size. Welch’s test was essential here because using the pooled t test with unequal variances would have over-estimated df and under-estimated the p-value.

Case Study 4: Manufacturing — Single Batch vs. Specification (One-Sample)

Setup: A pharmaceutical batch must contain 500mg of active ingredient per tablet. Quality control tests n=25 tablets: x̄=503.2mg, SD=4.8mg. Is the batch out of specification?

t = (503.2 − 500) / (4.8 / √25) = 3.2 / 0.96 = 3.33   |   df = 24   |   p = 0.003   |   d = 0.67

Conclusion: t(24) = 3.33, p = 0.003. The batch mean of 503.2mg is statistically significantly above the 500mg specification, indicating a systematic over-filling process that should be investigated and corrected.

T Test in Python, Excel, and R

For data analysts working programmatically, these are the standard functions for each t test type across common tools.

Python (SciPy)

from scipy import stats # One-sample t test t_stat, p_val = stats.ttest_1samp(data, popmean=75) # Two-sample independent t test (equal variances assumed) t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=True) # Welch's t test (unequal variances — recommended default) t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False) # Paired t test t_stat, p_val = stats.ttest_rel(before, after) # Effect size (Cohen's d) — not in SciPy, calculate manually import numpy as np cohens_d = (np.mean(group1) - np.mean(group2)) / pooled_sd

Microsoft Excel

=T.TEST(array1, array2, 2, 1) // Paired t test, two-tailed =T.TEST(array1, array2, 2, 2) // Independent, equal variance, two-tailed =T.TEST(array1, array2, 2, 3) // Welch's (unequal variance), two-tailed =T.DIST.2T(ABS(t_stat), df) // Two-tailed p-value from t statistic =T.INV.2T(0.05, df) // Critical t value at alpha=0.05

R

t.test(x, mu = 75) # One-sample t.test(group1, group2, var.equal = TRUE) # Two-sample (equal variance) t.test(group1, group2, var.equal = FALSE) # Welch's (default in R) t.test(before, after, paired = TRUE) # Paired # R reports: t, df, p-value, 95% CI, sample means

Related Topics on Statistics Fundamentals

The t test sits within a broader ecosystem of hypothesis testing and statistical inference. These pages build the complete picture.

Sources and Further Reading

Authority sources cited in this guide:

  • Gosset, W. S. (“Student”). (1908). The Probable Error of a Mean. Biometrika, 6(1), 1–25. The original t test paper. jstor.org
  • National Institute of Standards and Technology (NIST). Engineering Statistics Handbook — t Test. itl.nist.gov
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates. [Source for effect size conventions d = 0.2/0.5/0.8]
  • Welch, B. L. (1947). The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika, 34(1–2), 28–35. [Welch’s t test original paper]
  • MIT OpenCourseWare. 18.650 Statistics for Applications. ocw.mit.edu
  • Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133. ASA p-value statement
  • Diez, D., çetinkaya-Rundel, M., & Barr, C. (2022). OpenIntro Statistics, 4th ed. Free open-access textbook. openintro.org
  • Penn State Department of Statistics. STAT 415: Introduction to Mathematical Statistics. online.stat.psu.edu

FAQs

A t test is a parametric statistical test used to determine whether the means of one or two groups differ significantly. Use it when: (1) the outcome variable is continuous, (2) the population standard deviation σ is unknown, and (3) the data are approximately normally distributed or n > 30. If you have three or more groups, use ANOVA instead. If the normality assumption fails for small samples, use the non-parametric Mann–Whitney U or Wilcoxon test.

A two-tailed test asks whether the means differ in either direction (H₁: μ ≠ μ₀) and splits α across both tails of the t distribution. A one-tailed test asks whether the mean is specifically higher (H₁: μ > μ₀) or lower (H₁: μ < μ₀) and puts all α in one tail. One-tailed tests are more powerful but only appropriate when you have a directional hypothesis established before collecting data. Using one-tailed post-hoc to get significance is p-hacking.

The p-value is the probability of observing a t statistic as extreme as the one calculated, if the null hypothesis H₀ were true. A p-value of 0.04 means: if there truly were no effect, random sampling would produce a t this extreme only 4% of the time. It does not mean there is a 96% probability that H₁ is true. The ASA’s 2016 statement emphasizes that statistical significance at p < 0.05 does not by itself measure the importance, size, or replicability of an effect.

One-sample t test: df = n − 1. Paired t test: df = n − 1 (where n is the number of pairs). Independent two-sample t test: df = n₁ + n₂ − 2. Welch’s t test: df is estimated by the Welch–Satterthwaite equation and is usually non-integer and smaller than n₁ + n₂ − 2, making the test more conservative. Larger df makes the t distribution closer to the normal distribution, so critical values decrease toward 1.96.

Welch’s t test is a modification of the two-sample t test that does not assume equal population variances. It uses a separate variance estimate for each group instead of a pooled variance, and the Welch–Satterthwaite equation to estimate degrees of freedom. Use Welch’s when: (1) groups have different SDs (variance ratio > 2), (2) groups have very different sample sizes, or (3) you are unsure about the equal-variance assumption. Many statisticians recommend using Welch’s by default for all two-sample comparisons.

Effect size measures the practical magnitude of a difference, independent of sample size. Cohen’s d is the standard effect size for t tests: d = difference in means / pooled SD. A study with n=10,000 can produce p < 0.001 for a completely trivial difference (d = 0.01). Conversely, an important clinical effect (d = 0.8) might show p = 0.12 with only n=15. Always report both p-values and effect sizes. Effect size is what tells you whether a statistically significant result is scientifically or practically important.

Four assumptions: (1) The dependent variable is measured on an interval or ratio scale. (2) Observations are independent of each other. (3) The data are approximately normally distributed — the t test is robust to modest violations when n > 30. (4) For two-sample t tests: the two populations have equal variances (use Welch’s if violated). You can check normality with the Shapiro–Wilk test and check variance equality with Levene’s test. If both normality and small sample size are issues, use the Mann–Whitney U test.

The key difference is whether the population standard deviation σ is known. If σ is known, use a z test (uses the standard normal distribution). If σ is unknown and must be estimated from the sample, use a t test (uses the t distribution with heavier tails). In practice, σ is rarely known, so t tests are used almost universally. When sample size is very large (n > 100), the t and z tests give almost identical results because the t distribution converges to the normal distribution.

APA format requires: t(df) = [t value], p = [exact p-value], d = [Cohen’s d]. Example: “The treatment group scored significantly higher than the control group, t(48) = 3.12, p = .003, d = 0.88.” Always report exact p-values rather than “p < 0.05”. Include the 95% confidence interval for the mean difference. If the result is not significant, report “t(df) = [value], p = [value], which did not reach significance at α = .05.”

Technically, “p < 0.05” is significant and “p = 0.05” is not (the rule is strict inequality). However, treating 0.049 as significant and 0.051 as not is arbitrary dichotomization. The α = 0.05 threshold is a convention from R. A. Fisher, not a law of nature. A p-value of 0.05 provides weak evidence against H₀. Report the exact p-value, the effect size, and the confidence interval, and let readers judge the practical importance. Avoid “borderline significant” — a result either meets your pre-specified α or it does not.