What is the formula for the two sample t-test?

The two sample t-test formula is: t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂). Here x̄₁ and x̄₂ are the sample means, s₁² and s₂² are the sample variances, and n₁ and n₂ are the sample sizes. This is the Welch's variant, which works for both equal and unequal variances.

When should I use Welch's t-test instead of Student's t-test?

Use Welch's t-test when you cannot confirm that both groups have equal variances. It is the recommended default because it performs well regardless of whether variances are equal or unequal. Student's t-test requires the additional assumption of equal variances, verified by Levene's test (p > 0.05 before proceeding).

What are the assumptions of a two sample t-test?

The two sample t-test requires: (1) independent observations within and between groups, (2) a continuous outcome variable, (3) approximate normality in each group or n > 30 per group, (4) no extreme outliers, and (5) homogeneity of variance if using Student's t-test. Violating these assumptions can produce invalid p-values.

What is the difference between a two sample t-test and a paired t-test?

A two sample t-test compares means from two unrelated, independent groups. A paired t-test compares two measurements taken from the same subjects, such as before-and-after measurements. If the same people are measured twice, the paired t-test is more powerful because it removes between-subject variability.

How do you interpret the p-value in a two sample t-test?

If the p-value is less than your significance level (typically 0.05), you reject the null hypothesis and conclude the group means differ significantly. If p ≥ 0.05, you fail to reject the null hypothesis. A small p-value does not tell you the effect is large — always compute Cohen's d to measure practical significance.

How do I run a two sample t-test in Python?

Use scipy.stats.ttest_ind() with equal_var=False for Welch's test: from scipy import stats; t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=False). The function returns the t-statistic and two-tailed p-value.

What sample size is needed for a two sample t-test?

For a medium effect size (Cohen's d = 0.5) at 80% power and α = 0.05, each group needs roughly 64 observations. Smaller samples are viable for large effects (d = 0.8 requires about 26 per group). Use a power analysis tool before collecting data to avoid underpowered studies.

What does Cohen's d mean in a two sample t-test?

Cohen's d measures effect size: d = (x̄₁ − x̄₂) / s_pooled. Values below 0.2 indicate negligible effects, 0.2–0.5 is small, 0.5–0.8 is medium, and above 0.8 is large. A statistically significant result with small Cohen's d may have no practical importance.

Is the two sample t-test the same as the independent samples t-test?

Yes. The two sample t-test and the independent samples t-test are different names for the same procedure. Both compare the means of two unrelated groups. Statistical software like SPSS labels the output as Independent Samples T-Test, while textbooks and Python documentation typically say two-sample t-test.

Two Sample t-Test: Complete Guide with Formula & Examples

What Is a Two Sample t-Test?

Featured Answer — Independent Samples t-Test

A two sample t-test is a parametric statistical test that determines whether the means of two independent groups differ significantly. It divides the mean difference by the standard error of that difference, producing a t-statistic compared against a critical value from the t-distribution to evaluate the null hypothesis of equal population means.

t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

The test answers one concrete question: given that two sample means look different, how likely is that gap if the populations were actually identical? It accounts for the spread and size of each group, not just their averages. A mean difference of 5 points matters a lot in a tight distribution with n=200; it barely registers when the standard deviation is 40 and each group has only 8 observations.

Other names for the same procedure include the independent samples t-test, independent groups t-test, and unpaired t-test. Statistical software tends to use its own labeling: SPSS calls the output "Independent Samples T-Test," while Python's scipy.stats.ttest_ind() uses the Python convention. The underlying calculation is the same across all three. The full theoretical context sits within the hypothesis testing framework covered on Statistics Fundamentals.

⚡ Quick Reference — Two Sample t-Test Key Facts

Formula (Welch's): t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)
Default choice: Welch's t-test — works for both equal and unequal variances, unlike Student's
Decision rule: Reject H₀ if |t| > t_critical, or equivalently if p < α (usually 0.05)
Effect size: Always report Cohen's d alongside p — a tiny p-value from n=5000 can represent a negligible effect
Robustness: The t-test tolerates mild non-normality when both groups have n > 30 and equal sample sizes
Wrong test? Use a paired t-test for repeated measures, Mann-Whitney U for small non-normal samples, or one-way ANOVA for three or more groups

The Two Sample t-Test Formula: Welch's and Student's

Two versions of the formula exist because they make different assumptions about variance. Welch's version is the modern default; Student's version is taught heavily in introductory courses but requires an additional condition that is often violated in practice.

Welch's t-Test Formula (Recommended Default)

Welch's Two Sample t-Test — Independent Groups

x̄₁ − x̄₂
t = ─────────────────────
√( s₁²/n₁ + s₂²/n₂ )

Use when variances may differ between groups — the safe default in nearly all situations

x̄₁, x̄₂ = sample means of Group 1 and 2 s₁², s₂² = sample variances (using n−1) n₁, n₂ = sample sizes (can be unequal) t = resulting test statistic

The numerator captures the raw difference between the two group averages. The denominator — the standard error of the difference — scales that gap by how much variability exists in each group and how many observations were collected. As either sample size grows, the denominator shrinks, and the t-statistic grows. That behavior matches intuition: a mean difference of 5 points is more convincing with n=300 per group than with n=10.

Student's t-Test Formula (Equal Variances Required)

Student's t-Test — Pooled Variance Form

x̄₁ − x̄₂
t = ──────────────────────────────────────
s_p × √( 1/n₁ + 1/n₂ )

where s_p² = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁+n₂−2) — the pooled variance

s_p² = pooled variance (weighted average of both sample variances) df = n₁ + n₂ − 2 Requires: σ₁² = σ₂² (equal variances in population)

Student's formula pools the two variances into a single estimate, which is efficient when the equal-variance assumption actually holds. The degrees of freedom calculation is also simpler: n₁ + n₂ − 2. The problem is that equal variance must be verified with a test like Levene's before proceeding. When that assumption fails and you use Student's formula anyway, the Type I error rate inflates above the nominal α = 0.05.

Which Formula Should You Use?

💡

The Practical Default

Always use Welch's t-test unless you have a specific, tested reason to assume equal variances. Welch's formula produces valid results under both equal and unequal variance conditions. The cost of using it when variances are equal is trivially small. The cost of using Student's formula when variances are unequal can be substantial.

Feature	Student's t-Test	Welch's t-Test
Variance assumption	Equal (homogeneous)	None required
Degrees of freedom	n₁ + n₂ − 2 (simple)	Welch–Satterthwaite equation
When to use	Levene's test p > 0.05 AND you have a reason to test it	The default for most analyses
If assumption violated	Type I error inflates above α	Remains valid
R default	`var.equal = TRUE`	`t.test()` default
Python default	`equal_var=True`	`equal_var=False` (recommended)
SPSS output row	"Equal variances assumed"	"Equal variances not assumed"

5 Assumptions of the Two Sample t-Test

The t-test gives valid p-values only when the underlying data meets certain conditions. Some of these are absolute requirements; others are more flexible than most introductory courses suggest.

#	Assumption	How to Check	What Breaks If Violated
1	Independence — observations within and between groups have no relationship	Examine study design; cannot be tested statistically	Invalid standard error; cannot be fixed post-hoc
2	Continuous data — the outcome variable is interval or ratio scale	Check measurement type before analysis	Meaningless mean computation for categorical data
3	Approximate normality — each group's data follows a roughly normal distribution	Shapiro-Wilk test, Q-Q plots, histogram inspection	Invalid p-values, especially with small n (see robustness note below)
4	No extreme outliers — no single observation that strongly distorts the mean	Boxplot, z-score threshold \|z\| > 3	Biased mean and variance estimates
5	Homogeneity of variance — both populations share the same variance (Student's only)	Levene's test or Bartlett's test	Inflated Type I error rate when using Student's formula

🛡️

Expert Insight: The t-Test Is More Robust Than Your Textbook Implies

The two sample t-test tolerates violations of normality well when two conditions hold simultaneously: sample sizes are approximately equal between groups, and each group has n > 30. In those cases, the Central Limit Theorem ensures the sampling distribution of the mean difference approaches normality regardless of the underlying data's shape. If your histogram looks slightly right-skewed but you have 45 observations per group with balanced sizes, your p-value remains valid. Reserve the non-parametric Mann-Whitney U test for genuinely small samples (n < 15) combined with strong skew or multiple outliers.

Degrees of Freedom: Simple vs. Welch–Satterthwaite

Degrees of freedom determine which t-distribution you compare your test statistic against. Larger df means the t-distribution more closely resembles the normal distribution, producing smaller critical values and narrower confidence intervals.

Student's t-test df: df = n₁ + n₂ − 2. Two parameters are estimated (the two group means), so two degrees are consumed.

Welch's df (Welch–Satterthwaite equation):

Welch–Satterthwaite Degrees of Freedom

( s₁²/n₁ + s₂²/n₂ )²
df = ──────────────────────────────────────────
(s₁²/n₁)² / (n₁−1) + (s₂²/n₂)² / (n₂−1)

Result is typically a non-integer — round down to the nearest whole number

Welch's df is always less than or equal to Student's df = n₁ + n₂ − 2. This is intentional: using fewer df produces wider critical value boundaries, which builds in extra caution for the variance uncertainty. Statistical software handles this automatically; you rarely need to compute it by hand.

Step-by-Step Worked Example

A researcher investigates whether background music during study affects exam performance. Students are randomly assigned to two conditions: Group A studies with music (n = 30, x̄ = 74, s = 8), and Group B studies in silence (n = 30, x̄ = 79, s = 7). Does the mean difference reach statistical significance at α = 0.05?

Step 1 — State the Hypotheses

Hypothesis Setup

Two-tailed test: no prior expectation about direction

H₀

Null hypothesis: μ₁ = μ₂ — the two populations have equal means. Music has no effect on exam scores.

Hₐ

Alternative hypothesis: μ₁ ≠ μ₂ — the population means differ. Music does affect exam scores (two-tailed).

Step 2 — Check Assumptions

Assumption Check

All five assumptions are satisfied in this scenario

✓

Independence: Students were randomly assigned to conditions and studied separately.

✓

Continuous data: Exam scores are ratio-scale measurements.

✓

Normality: n = 30 per group with equal sizes — the Central Limit Theorem applies. The t-test is robust here.

✓

Variances: Group A: s² = 64; Group B: s² = 49. Variances differ, so Welch's t-test is appropriate.

Step 3 — Calculate the t-Statistic

Welch's t-Test Calculation

Group A: n=30, x̄=74, s=8 | Group B: n=30, x̄=79, s=7

Compute variance terms: s₁²/n₁ = 64/30 = 2.133 s₂²/n₂ = 49/30 = 1.633

Standard error of the difference: SE = √(2.133 + 1.633) = √3.767 ≈ 1.941

t-statistic: t = (74 − 79) / 1.941 = −5 / 1.941 ≈ −2.576

Degrees of freedom (Welch–Satterthwaite): df ≈ (2.133 + 1.633)² / [(2.133²/29) + (1.633²/29)] ≈ 57

Critical value: At α = 0.05, two-tailed, df = 57: t_critical ≈ ±2.002. Use the t-distribution table for exact critical values.

✓ |−2.576| = 2.576 > 2.002 → Reject H₀. The mean exam score difference between study-with-music (74) and study-in-silence (79) is statistically significant at α = 0.05. t(57) = −2.576, p ≈ 0.013.

Step 4 — Interpret the Result with Effect Size

⚠️

The Most Common Mistake: Stopping at the p-Value

p = 0.013 means the result is statistically significant, not that it is important. Compute Cohen's d: d = (74 − 79) / s_pooled. With s_pooled = √[(29×64 + 29×49)/58] ≈ 7.52, we get d = 5/7.52 ≈ 0.66 — a medium effect. This tells researchers the practical magnitude, not just whether it cleared an arbitrary threshold.

−2.576

t-statistic

0.013

p-value

Degrees of freedom

0.66

Cohen's d (medium)

[−8.9, −1.1]

95% CI for mean diff.

🧮 Two Sample t-Test Calculator

Enter summary statistics for each group. The calculator returns t, p, df, Cohen's d, and the 95% confidence interval for the mean difference.

Group 1

Sample Mean (x̄₁)

Std Deviation (s₁)

Sample Size (n₁)

Group 2

Sample Mean (x̄₂)

Std Deviation (s₂)

Sample Size (n₂)

Test Type

Tail

t-Statistic

—

p-Value

—

Degrees of Freedom

—

Cohen's d

—

95% Confidence Interval (Mean Difference)

—

Effect Size (Cohen's d)

NegligibleSmallMediumLarge

Which t-Test Should You Use? Decision Guide

The two sample t-test applies in one specific situation: two independent groups measured on a continuous outcome. Step through the flowchart below before running any analysis.

📊 t-Test Selection Flowchart

I want to compare means

▼

❶ How many groups?

1 Group

▼

One-Sample t-Test
vs. known value

2 Groups ↓

3+ Groups

▼

⚠ One-Way ANOVA
Multiple t-tests inflate α

▼ (2 groups path)

❷ Are the groups independent?

No — Matched/Repeated

▼

Paired t-Test
Before/after, same subjects

Yes — Independent

▼

❸ Normal / n > 30?

▼

Mann-Whitney U
Non-parametric

Yes

▼

❹ Equal variances?
(Levene's test)

p > 0.05

▼

Student's t-Test
Pooled variance

p ≤ 0.05

▼

Welch's t-Test ✓
Recommended default

💡 When in doubt between Student's and Welch's, always choose Welch's — it is valid under both scenarios.

Two Sample vs. Paired vs. One-Sample t-Test

The three t-test variants cover different study designs. Choosing the wrong one produces incorrect degrees of freedom, wrong standard errors, and invalid p-values.

Feature	Two-Sample (Independent)	Paired t-Test
Group relationship	Unrelated, separate subjects	Same subjects measured twice
Example	Drug A patients vs Drug B patients	Patients before vs after treatment
Null hypothesis	μ₁ = μ₂	μ_d = 0 (mean difference = 0)
Formula basis	Difference in group means	Mean of within-subject differences
Degrees of freedom	Welch–Satterthwaite	n − 1
More powerful when	Groups are truly independent	High correlation within pairs
R function	`t.test(x, y, paired=FALSE)`	`t.test(x, y, paired=TRUE)`
Python function	`ttest_ind(a, b)`	`ttest_rel(a, b)`

Interpreting Results: p-Value, Confidence Interval, and Effect Size

Three outputs matter for a complete interpretation: the p-value (does the difference clear the significance threshold?), the confidence interval (what is the plausible range for the true mean difference?), and Cohen's d (how large is the effect in practical terms?).

Output	What It Tells You	What It Does Not Tell You
p < 0.05	The observed difference is unlikely under H₀ at α = 0.05	That the effect is large or practically meaningful
p ≥ 0.05	Insufficient evidence to reject H₀	That the means are equal (absence of evidence ≠ evidence of absence)
95% CI excludes 0	Significant mean difference at α = 0.05	The precision of the estimate (wide CI = uncertain estimate)
Cohen's d < 0.2	Negligible effect regardless of p	Nothing — a significant negligible effect is still negligible
Cohen's d 0.2–0.5	Small but potentially meaningful effect	Whether the effect matters in your specific context
Cohen's d 0.5–0.8	Medium effect — likely noticeable in practice	—
Cohen's d > 0.8	Large effect — clearly important in most contexts	—

🚨

p-Hacking and the Family-Wise Error Rate

Running multiple t-tests on the same dataset compounds your false positive rate. With k tests at α = 0.05, the Family-Wise Error Rate = 1 − (1 − 0.05)ᵏ. At k = 5 tests, that reaches 22.6% — nearly one-in-four chance of a spurious significant result. When comparing subgroups, time points, or multiple outcomes, apply Bonferroni correction (α* = 0.05/k) or the Benjamini-Hochberg procedure to control error rates across the test family.

Running a Two Sample t-Test in Python, R, and SPSS

Python — scipy.stats.ttest_ind()

# Two sample t-test in Python — Welch's version (recommended)
from scipy import stats
import numpy as np

# Example data: exam scores under two study conditions
music   = np.array([72, 68, 79, 74, 70, 76, 65, 80, 73, 71])
silence = np.array([81, 78, 84, 77, 79, 82, 76, 83, 80, 75])

# Welch's t-test: equal_var=False (default is True → Student's)
t_stat, p_val = stats.ttest_ind(music, silence, equal_var=False)
print(f"t = {t_stat:.4f}, p = {p_val:.4f}")

# Cohen's d — effect size
pooled_sd = np.sqrt((np.std(music, ddof=1)**2 + np.std(silence, ddof=1)**2) / 2)
cohens_d  = (np.mean(music) - np.mean(silence)) / pooled_sd
print(f"Cohen's d = {cohens_d:.4f}")

# Levene's test — check equal variance assumption first
f_stat, p_levene = stats.levene(music, silence)
print(f"Levene's test: F = {f_stat:.3f}, p = {p_levene:.4f}")
# If p_levene < 0.05, use Welch's (equal_var=False)
        

R — t.test()

# Two sample t-test in R — Welch's is the default
music   <- c(72, 68, 79, 74, 70, 76, 65, 80, 73, 71)
silence <- c(81, 78, 84, 77, 79, 82, 76, 83, 80, 75)

# Welch's t-test (var.equal = FALSE is the default in R)
result <- t.test(music, silence, var.equal = FALSE)
print(result)
# Output includes: t, df (Satterthwaite), p-value, 95% CI, means

# Student's t-test (equal variances assumed)
result_student <- t.test(music, silence, var.equal = TRUE)

# Levene's test for equality of variances
# install.packages("car") if needed
library(car)
leveneTest(c(music, silence), factor(c(rep("music", 10), rep("silence", 10))))

# Cohen's d (install effsize package)
library(effsize)
cohen.d(music, silence)
        

SPSS — Independent Samples T-Test

In SPSS, navigate to Analyze → Compare Means → Independent Samples T-Test. Move your outcome variable to the "Test Variable(s)" box and your grouping variable to the "Grouping Variable" box, then click "Define Groups" to specify the two values. The output table contains two rows: read the "Equal variances not assumed" row (Welch's) unless Levene's test shows p > 0.05, in which case the "Equal variances assumed" row applies.

Real-World Applications of the Two Sample t-Test

The test appears wherever two groups are measured and a mean comparison is required. Here are five domains where it does routine work.

💊

Clinical Trials

Treatment vs. placebo groups: does the drug reduce blood pressure more than the control? The two sample t-test on mean blood pressure reduction answers this directly, provided randomization ensures independence.

🏫

Education Research

Do students taught with one method score higher than those taught with another? Schools and ed-tech companies use independent samples t-tests to evaluate curriculum changes before scaling.

💼

A/B Testing

Does landing page version A convert at a higher average order value than version B? When the outcome variable is a continuous metric (revenue, time on page, session depth), the two sample t-test applies.

🏭

Quality Control

Two production lines produce parts with different mean diameters. Is the difference statistically significant, or within expected process variation? The test quantifies whether a process change produced a real shift.

🤖

Machine Learning

Comparing model error rates across two architectures on independent test sets. Because test-set performance metrics (RMSE, MAE) are continuous outcomes, the independent samples t-test handles the comparison formally.

Entity and Formula Glossary

The table below defines every term used in two sample t-test analyses. Each definition uses plain language first, followed by the formal symbol.

Term / Symbol	Formula	Plain-Language Definition
t-statistic	t	The ratio of the observed mean difference to the standard error; measures how many standard errors separate the two group means under H₀.
Null hypothesis	H₀: μ₁ = μ₂	The default assumption that both populations share the same mean; the test checks whether the data provide enough evidence to reject this.
Alternative hypothesis	Hₐ: μ₁ ≠ μ₂	The research claim that the two population means are not equal, evaluated against H₀.
p-value	p	The probability of observing a test statistic as extreme as the calculated one, assuming H₀ is true; values below α lead to rejection.
Alpha level	α = 0.05	The pre-set significance threshold; the maximum tolerable probability of a false positive (Type I error).
Degrees of freedom	df	The number of independent values free to vary; determines which t-distribution to consult for the critical value.
Pooled variance	s_p² = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁+n₂−2)	A weighted average of the two sample variances used in Student's t-test under the equal-variance assumption.
Standard error of difference	SE = √(s₁²/n₁ + s₂²/n₂)	The estimated standard deviation of the sampling distribution of the mean difference; the denominator of the t-formula.
Cohen's d	d = (x̄₁ − x̄₂) / s_pooled	A standardized effect size measuring the mean difference in units of pooled standard deviation; independent of sample size.
Critical value	t_{α/2, df}	The t-statistic threshold beyond which H₀ is rejected; depends on α, tail direction, and degrees of freedom.
Confidence interval	CI = (x̄₁−x̄₂) ± t* × SE	A range of plausible values for the true population mean difference at the specified confidence level.
Levene's test	F-statistic	A preliminary hypothesis test that checks whether two groups have equal variances; determines which t-test variant applies.
Type I error	α	Rejecting a true null hypothesis — a false positive — controlled by the significance level choice.
Type II error	β	Failing to reject a false null hypothesis — a false negative — related to statistical power (1 − β).
Statistical power	1 − β	The probability of correctly detecting a true effect; increases with larger sample sizes and larger effect sizes.
Family-Wise Error Rate	FWER = 1 − (1−α)ᵏ	The cumulative probability of at least one false positive when running k tests simultaneously on the same dataset.

5 Common Mistakes and How to Avoid Them

#	The Mistake	The Correct Approach
1	Using Student's t-test without first verifying equal variances — inflates the Type I error rate when variances differ	Default to Welch's t-test. Run Levene's test only if you have a reason to prefer Student's, and switch to Student's only if Levene's p > 0.05.
2	Concluding "the means are equal" when p ≥ 0.05 — this is the absence-of-evidence fallacy	Failing to reject H₀ means the data do not provide enough power to detect a difference, not that no difference exists. Compute a confidence interval to bound the plausible effect range.
3	Reporting only p and ignoring effect size — a p = 0.0001 from n=10,000 can correspond to Cohen's d = 0.04 (meaningless)	Always report Cohen's d alongside the t-statistic and p-value. Many journals now require effect size reporting as a condition of publication.
4	Running multiple t-tests on subgroups of the same dataset without correction — inflates FWER well above 5%	Apply Bonferroni correction (α* = 0.05/k) or Benjamini-Hochberg procedure when testing multiple comparisons. Consider one-way ANOVA for multiple group comparisons from the start.
5	Using a two sample t-test on matched or repeated-measures data — loses the within-subject correlation and reduces power	When the same subjects are measured twice (before/after, left/right hand, matched pairs), use a paired t-test. The pairing removes between-subject noise and produces a more powerful test.

Frequently Asked Questions

Yes — these are different names for the same test. SPSS labels the output "Independent Samples T-Test." Python's scipy library uses ttest_ind(). Textbooks and research papers may say two-sample t-test, independent groups t-test, or unpaired t-test. All refer to comparing means from two separate, unrelated samples.

Import scipy.stats and call stats.ttest_ind(group_a, group_b, equal_var=False). The equal_var=False argument selects Welch's variant. The function returns a t-statistic and a two-tailed p-value. For one-tailed tests, divide the p-value by 2, but only if the t-statistic is in the direction predicted by your directional hypothesis.

For a medium effect size (Cohen's d = 0.5), 80% power, and α = 0.05, each group needs approximately 64 observations. For a large effect (d = 0.8), about 26 per group. For a small effect (d = 0.2), around 394 per group. Use a power analysis tool — such as G*Power (free) or R's pwr::pwr.t.test() — to calculate the required sample size before data collection.

Peer-Reviewed References

The following peer-reviewed sources and authoritative references support the statistical claims in this guide:

Welch, B.L. (1947). "The generalization of Student's problem when several different population variances are involved." Biometrika, 34(1–2), 28–35. — Original paper introducing the Welch's t-test correction for unequal variances. (doi:10.2307/2332510)
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge. — Standard reference for effect size conventions including Cohen's d benchmarks of 0.2, 0.5, and 0.8.
Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). "The importance of the normality assumption in large public health data sets." Annual Review of Public Health, 23, 151–169. — Documents t-test robustness to non-normality with large, equal sample sizes. (doi:10.1146/annurev.publhealth.23.100901.140546)
Delacre, M., Lakens, D., & Leys, C. (2017). "Why psychologists should by default use Welch's t-test instead of Student's t-test." International Review of Social Psychology, 30(1), 92–101. — Empirical case for Welch's as the universal default over Student's. (doi:10.5334/irsp.82)
Ioannidis, J.P.A. (2005). "Why most published research findings are false." PLOS Medicine, 2(8), e124. — Foundational paper on multiple comparisons, p-hacking, and the Family-Wise Error Rate problem in published research. (doi:10.1371/journal.pmed.0020124)