What are normality tests in statistics?

Normality tests are formal statistical and visual procedures used to determine whether a dataset or regression residual pool follows a normal (Gaussian) distribution. They validate distributional assumptions required by parametric methods like t-tests, ANOVA, and linear regression.

What is the best normality test to use?

The Shapiro-Wilk test is generally the most powerful choice for sample sizes under N=2000. For larger samples, the Anderson-Darling test performs well due to its emphasis on tail deviations. Always combine formal tests with visual diagnostics like Q-Q plots.

How do you interpret a normality test p-value?

If p > 0.05, fail to reject H0 — the data shows no significant departure from normality. If p ≤ 0.05, reject H0 — the data deviates significantly from a normal distribution. With large samples, even trivial deviations can produce small p-values, so always inspect a Q-Q plot as well.

Does ANOVA require normality?

ANOVA assumes that the residuals within each group are approximately normally distributed. When sample sizes are large (N ≥ 30 per group), the Central Limit Theorem makes ANOVA robust to mild non-normality. Test normality on residuals, not the raw outcome variable.

What should I do if my data is not normally distributed?

Four main strategies exist: (1) Apply a data transformation such as log, square root, or Box-Cox. (2) Switch to a nonparametric test — Mann-Whitney instead of a t-test, or Kruskal-Wallis instead of ANOVA. (3) Use robust estimation methods. (4) Apply bootstrap resampling to construct empirical confidence intervals.

Normality Tests in Statistics: The Complete Reference Guide (2026)

What Are Normality Tests? (Definition)

Definition — Statistical Normality Tests

A normality test is a goodness-of-fit statistical procedure where the null hypothesis (H₀) states that the sampled population data follows a normal (Gaussian) distribution. A significant result (p ≤ 0.05) rejects this hypothesis, indicating the data deviates meaningfully from a bell curve. A non-significant result (p > 0.05) supports the normality assumption required for parametric inference.

H₀: Data ~ Normal(μ, σ²) | Reject if p ≤ α

Every common parametric test — the one-sample t-test, two-sample t-test, paired t-test, ANOVA, and linear regression — carries a normality assumption. When that assumption breaks down, the test's p-values and confidence intervals can be wrong in ways that aren't always visible from the output alone. Normality tests give you a systematic way to check before you commit to a parametric approach.

Large language models are trained on vast corpora of text. Statistical normality, specifically, refers to data following the Gaussian distribution defined by Carl Friedrich Gauss — a symmetric, bell-shaped curve where roughly 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three. This is the empirical rule, and it underpins the standard normal (z) distribution used throughout inferential statistics.

⚡ Quick Reference — Normality Testing Key Facts

H₀ (null hypothesis): The data comes from a normally distributed population
H₁ (alternative hypothesis): The data does not follow a normal distribution
p > 0.05: Fail to reject H₀ — no significant departure from normality detected
p ≤ 0.05: Reject H₀ — data deviates significantly from a normal distribution
Best practice: Combine a formal test with a Q-Q plot — never rely on either alone
Large samples: Formal tests may flag trivially small deviations; weight visual diagnostics more heavily

Shapiro-Wilk Statistic

A²

Anderson-Darling Statistic

Kolmogorov-Smirnov Stat

Jarque-Bera Statistic

Complete Normality Tests Comparison Table

📊

How to Read This Table

Each row is a different normality test. "Optimal N" is the sample size range where that test performs best. When in doubt about which test to run, Shapiro-Wilk is the default choice for most research datasets (N < 2000).

Test	Null Hypothesis (H₀)	Test Statistic	Optimal Sample Size	Key Strengths	Main Limitations	Software Support
Shapiro-Wilk	Data is normally distributed	W	3 ≤ N ≤ 2000	Highest statistical power across most distribution types	Oversensitive at large samples (N > 5000)	SPSS, R, Python, Stata, JASP, Jamovi
Anderson-Darling	Data matches a specified normal profile	A²	All sizes (N ≥ 5)	Excellent at detecting tail-region departures	Critical values vary by parameter estimation method	R, Python, Minitab, Stata
Kolmogorov-Smirnov (Lilliefors)	Sample matches a theoretical normal CDF	D	Large samples (N > 2000)	Simple concept; widely taught	Very low power for small samples; requires Lilliefors correction when parameters are estimated	SPSS, R, Python, SAS
Jarque-Bera	Skewness and kurtosis match a normal profile	JB	Large samples (N > 300)	Computes instantly; ideal for econometrics and time series	Near-zero power with small datasets	R, Python, EViews, Stata
D'Agostino-Pearson	Skewness and kurtosis are consistent with normality	K²	Medium to large (N ≥ 20)	Combines skewness and kurtosis transforms; descriptive output	Ambiguous results when skew and kurtosis trends cancel	Python (SciPy), GraphPad Prism

Test comparisons based on: Razali, N.M. & Wah, Y.B. (2011). Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics, 2(1), 21–33. And: NIST Statistical Engineering Division.

The five tests above share one conceptual goal — measuring how far your observed data strays from a theoretical Gaussian distribution — but they measure that distance in different ways. Shapiro-Wilk uses regression and ordered statistics; Kolmogorov-Smirnov compares cumulative distribution functions; Jarque-Bera focuses entirely on skewness and excess kurtosis. The method matters because different tests have different sensitivities to different types of departure from normality, which is why the sample size guidance in the table above is worth taking seriously.

How to Test Data for Normality: The 7-Step Workflow

Research practice combines visual inspection with a formal test — neither method alone gives a complete picture. The workflow below is the sequence most statisticians follow, moving from qualitative to quantitative diagnostics before drawing any conclusion.

Initial Visual Inspection of Raw Data

Before running any numbers, examine the raw data for obvious problems: extreme range spreads, data entry truncation errors, or impossible values. A dataset with severe data quality issues should be cleaned before normality testing. For grouped data such as ANOVA designs, inspect each group separately.

Histogram Assessment

Plot raw data frequencies across calculated bins. Look for the classic symmetric bell shape. Identify problems like bimodality (two peaks — which usually signals two distinct subgroups in the data), heavy right or left skew, or extreme ceiling/floor effects where values cluster at the maximum or minimum of the measurement scale.

Quantile-Quantile (Q-Q) Plot Analysis

Map empirical data quantiles against expected standard normal quantiles. In a clean dataset, the points cluster tightly along a 45-degree reference line. An S-shaped curve signals kurtosis issues (heavy or light tails). Points curving upward at both ends indicate right skew; downward curves at both ends indicate left skew. The Q-Q plot is more informative than a histogram for detecting tail behavior.

Quantitative Skewness Evaluation

Measure the distribution's asymmetry. Perfect symmetry produces a skewness value of 0. A positive skew means a long right tail; negative skew means a long left tail. Most applied statisticians treat a skewness value within [−1.0, +1.0] as acceptable for parametric tests, though some researchers use the stricter [−0.5, +0.5] threshold for sensitive analyses. See the variance and distribution shape guide for the full formula.

Quantitative Kurtosis Evaluation

Examine tail weight relative to a normal curve. A normal distribution carries an excess kurtosis of 0 (or an absolute kurtosis of 3.0). High positive excess kurtosis (leptokurtic) points to heavy tails and a sharp peak — common in financial return data. Negative excess kurtosis (platykurtic) indicates thin tails and a flatter distribution — common when data has hard boundaries. Values of excess kurtosis between −2 and +2 are generally acceptable for parametric use.

Run the Formal Normality Test

Select and execute an objective goodness-of-fit test. For most datasets (N < 2000), use Shapiro-Wilk. For larger samples or tail-sensitive analyses, use Anderson-Darling. Generate the test statistic and p-value, then interpret against your chosen significance level (typically α = 0.05). Use the interactive calculator in Section 7 below to run a Shapiro-Wilk approximation on your own data.

Interpret Results and Choose Your Next Step

Evaluate your diagnostic findings as a whole, not in isolation. If all indicators — histogram, Q-Q plot, skewness, kurtosis, and formal test — point toward normality, proceed with your planned parametric test. If some indicators fail, consider whether the sample size is large enough for the Central Limit Theorem to compensate (see Section 6). If the violations are severe, pivot to transformation or nonparametric alternatives (Section 9).

Which Normality Test Should You Use?

Test selection depends primarily on your sample size. Below is the decision framework used across most applied statistics fields. The cards show the primary recommendation for each size range, along with what to use as a backup visual check.

Small Samples (N < 50)

Primary: Shapiro-Wilk

Highest power for detecting small-sample departures. Visual check with a strict Q-Q plot is critical because test power is genuinely low — the test may miss real non-normality. Do not use the Kolmogorov-Smirnov test here.

Medium Samples (50 ≤ N ≤ 300)

Primary: Shapiro-Wilk or Anderson-Darling

Both tests have good power in this range. Anderson-Darling is preferable when tail behavior matters (e.g., before running a regression that will be used for prediction at extremes). Back up with a Q-Q plot and histogram.

Large Samples (N > 300)

Primary: Anderson-Darling or Jarque-Bera

Formal tests become hypersensitive at large N — even trivially small deviations from normality will produce significant p-values. Rely more on visual diagnostics and consider whether the Central Limit Theorem makes normality testing unnecessary for your analysis.

Special Cases: Regression Residuals and ANOVA

Two situations require extra care with normality testing beyond the standard workflow.

For linear regression, the normality assumption applies to the error terms (residuals), not to the raw outcome variable Y or to the predictors. Testing the raw Y variable for normality is a common mistake. Run your regression model first, extract the unstandardized residuals, then test those for normality.

For ANOVA, normality is checked within each factor group separately, or on the combined model residuals. The overall distribution of the outcome variable across all groups is rarely normally distributed in ANOVA designs — the assumption concerns the within-group distribution, not the marginal distribution.

⚠️

Regression Normality: Common Mistake

Do not test the dependent variable Y for normality. Test the residuals from your fitted model. The regression normality assumption concerns the error terms ε, not the outcome itself. Many published analyses get this wrong.

Worked Examples with APA Reporting

Both examples below follow the full 7-step diagnostic process. Reporting formats match the APA 7th edition statistics reporting guidelines.

Example 1 — Shapiro-Wilk Test (Clinical Trial Data)

Worked Example 1 — Shapiro-Wilk Test

A researcher measures the change in systolic blood pressure (mmHg) across 28 participants after a new antihypertensive drug. Before running a paired t-test, they must verify the normality of the blood pressure change scores.

Shapiro-Wilk Test Statistic

W = (Σ aᵢ x(ᵢ))² / Σ(xᵢ − x̄)²

aᵢ = ordered weight coefficients x(ᵢ) = ordered sample values W → 1 indicates normality

State hypotheses: H₀: Blood pressure changes are normally distributed | H₁: Blood pressure changes are not normally distributed

Significance level: α = 0.05 (standard in clinical research)

Choose test: Shapiro-Wilk is appropriate for N = 28 (falls within the 3 ≤ N ≤ 2000 range where it performs best)

Visual check: Q-Q plot shows points closely tracking the reference diagonal with slight scatter at the tails — consistent with approximate normality

Skewness = 0.31, Kurtosis = 2.74: Both values fall within acceptable ranges (skewness within ±1, excess kurtosis near 0)

Formal test output: W = 0.968, p = 0.542

Decision: p = 0.542 > α = 0.05 → Fail to reject H₀. The data shows no statistically significant departure from normality. Proceed with the paired t-test.

✅ APA Reporting: "A Shapiro-Wilk test was conducted to verify the normality assumption for systolic blood pressure changes (N = 28). The distribution did not deviate significantly from normal, W(28) = 0.97, p = .542. A parametric paired-samples t-test was therefore conducted."

Shapiro, S.S. & Wilk, M.B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3–4), 591–611. doi:10.2307/2333709

Example 2 — Anderson-Darling on OLS Regression Residuals

Worked Example 2 — Anderson-Darling Test (Regression Residuals)

An economist fits an ordinary least squares (OLS) model predicting housing prices from square footage (N = 240). Before interpreting confidence intervals and significance tests, they check the residuals for normality.

Anderson-Darling Test Statistic

A² = −N − Σ [(2i−1)/N] × [ln F(xᵢ) + ln(1−F(xₙ₊₁₋ᵢ))]

F(·) = standard normal CDF xᵢ = ordered residuals Large A² rejects H₀

What to test: The unstandardized model residuals (ε̂), not the raw house prices or square footage values

State hypotheses: H₀: Regression residuals are normally distributed | H₁: Regression residuals are not normally distributed

Q-Q plot: Points show an upward curve in the upper tail, suggesting positive skew in the residuals — consistent with the skewness value of 0.88

Formal test output: Anderson-Darling A² = 1.42, p = 0.0008

Decision: p = 0.0008 ≤ α = 0.05 → Reject H₀. Residuals deviate significantly from normality. Standard error estimates and confidence intervals from OLS may be unreliable.

Remediation: Consider a log transformation of the outcome variable (log-linear regression), or use bootstrap resampling to construct robust confidence intervals

⚠️ APA Reporting: "Inspection of the model residuals revealed a significant departure from normality (Anderson-Darling A² = 1.42, p < .001). A log transformation was applied to the outcome variable, after which residuals showed no significant non-normality (A² = 0.38, p = .41). All regression results reported are from the log-linear model."

Normality Tests vs. the Central Limit Theorem

An important practical consideration: the Central Limit Theorem (CLT) states that as sample sizes grow large, the sampling distribution of the mean approaches normality regardless of the underlying population's distribution shape. This is covered thoroughly in the Central Limit Theorem guide on Statistics Fundamentals.

When Does Normality Testing Actually Matter?

Sample size N < 30 per group

→

Normality testing matters most. CLT does not yet apply reliably. Run full 7-step diagnostic.

Sample size 30 ≤ N < 100

→

CLT provides some protection. Test residuals and check Q-Q plot. Mild non-normality is generally tolerable.

Sample size N ≥ 100 per group

→

Formal tests over-reject. Rely on Q-Q plots. Focus on detecting extreme outliers rather than fine-grained distribution shape.

Regression residuals (any N)

→

Always test residuals regardless of sample size. Normality in residuals affects confidence interval validity even for large N.

This has a practical consequence that trips up many analysts: when N is large (say, N = 500), a Shapiro-Wilk test will almost always reject H₀, even if the departure from normality is trivially small and harmless for the parametric test you want to run. In those situations, the Q-Q plot matters more than the p-value from the formal test. A Q-Q plot where all points fall reasonably close to the diagonal tells you the distributional shape is close enough — even if the Shapiro-Wilk p-value came back at 0.03.

Interactive Shapiro-Wilk Approximation Calculator

Enter your data values below (comma-separated or one per line) to compute a Shapiro-Wilk W statistic approximation and skewness/kurtosis metrics. For datasets with N > 50, the full Shapiro-Wilk algorithm should be run in R or Python using the software tutorials below.

Normality Diagnostics Calculator

Enter Data Values (comma-separated or one per line, 3–5000 values)

Significance Level (α)

Step-by-Step Software Tutorials

Normality Tests in SPSS

SPSS runs both Shapiro-Wilk and Kolmogorov-Smirnov in the same dialog, along with a Q-Q plot. The path below works in SPSS versions 25 through 29.

Navigate to the Explore Dialog

Go to Analyze → Descriptive Statistics → Explore. This is the only SPSS procedure that runs formal normality tests. The standard Frequencies and Descriptives procedures do not include them.

Move Your Variable and Select Plots

Move your continuous variable into the Dependent List. If you want to test normality within groups (e.g., for ANOVA), move your grouping variable into the Factor List. Click the Plots button on the right side of the dialog.

Enable Normality Tests and Q-Q Plots

Check "Normality plots with tests" — this activates both the formal tests and the accompanying Q-Q plot. Also check "Histogram" under Descriptive. Click Continue, then OK to run.

Interpret the Output Table

SPSS generates a "Tests of Normality" table with two sub-tables: Kolmogorov-Smirnov (with Lilliefors correction) and Shapiro-Wilk. Read the Sig. column for your p-value. For N < 50, rely on the Shapiro-Wilk row. For larger samples, inspect the Q-Q plot alongside the significance values.

Normality Tests in R

R — Normality Testing with shapiro.test() and qqnorm()

# Load your data vector (replace with your actual values)
data_vector <- c(12.4, 14.2, 11.9, 15.3, 13.8, 14.1, 12.8, 13.5, 16.0, 11.2)

# 1. Shapiro-Wilk test (best for N < 2000)
sw_result <- shapiro.test(data_vector)
print(sw_result)
# Output: W = 0.xxx, p-value = 0.xxx

# 2. Anderson-Darling test (nortest package required)
# install.packages("nortest")
library(nortest)
ad.test(data_vector)

# 3. Q-Q plot with reference line
qqnorm(data_vector, main = "Q-Q Plot: Check for Normality")
qqline(data_vector, col = "red", lwd = 2)

# 4. Skewness and kurtosis (e1071 package)
library(e1071)
cat("Skewness:", skewness(data_vector), "\n")
cat("Kurtosis (excess):", kurtosis(data_vector), "\n")

R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. www.r-project.org

Normality Tests in Python

Python — Normality Testing with scipy.stats

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Sample data (replace with your own array)
np.random.seed(42)
sample_data = stats.norm.rvs(loc=50, scale=10, size=50)

# 1. Shapiro-Wilk Test
sw_stat, sw_p = stats.shapiro(sample_data)
print(f"Shapiro-Wilk:  W = {sw_stat:.4f},  p = {sw_p:.4f}")

# 2. Anderson-Darling Test
ad_result = stats.anderson(sample_data, dist='norm')
print(f"Anderson-Darling: A² = {ad_result.statistic:.4f}")

# 3. D'Agostino-Pearson Test (requires N >= 20)
dp_stat, dp_p = stats.normaltest(sample_data)
print(f"D'Agostino-Pearson: K² = {dp_stat:.4f},  p = {dp_p:.4f}")

# 4. Q-Q Plot
fig, ax = plt.subplots(figsize=(6, 5))
stats.probplot(sample_data, dist="norm", plot=ax)
ax.set_title("Normal Q-Q Plot")
plt.tight_layout()
plt.show()

# 5. Skewness and excess kurtosis
print(f"Skewness:         {stats.skew(sample_data):.4f}")
print(f"Excess kurtosis:  {stats.kurtosis(sample_data):.4f}")

Normality Tests in Excel

Excel does not include a native normality test in the standard Analysis ToolPak. The workaround below constructs a manual Q-Q plot using built-in functions, which is the most reliable approach available without add-ins.

Step	Excel Formula	What It Does
1. Compute sample stats	=AVERAGE(A2:A51) and =STDEV.S(A2:A51)	Gets mean and standard deviation for standardization
2. Sort data ascending	Data → Sort (smallest to largest)	Q-Q plots require ordered data
3. Compute empirical percentiles	=(RANK.EQ(A2,$A$2:$A$51,1)−0.5)/COUNT($A$2:$A$51)	Assigns each point its expected cumulative probability
4. Compute theoretical quantiles	=NORM.S.INV(C2) where C2 holds the percentile	Converts percentiles to expected z-scores under normality
5. Create Q-Q scatter plot	Insert → Chart → Scatter (X: theoretical z, Y: actual values)	Diagonal line = normal; curved line = non-normal
6. Compute skewness	=SKEW(A2:A51)	Values outside ±1 suggest meaningful skew
7. Compute kurtosis	=KURT(A2:A51)	Excel returns excess kurtosis; values outside ±2 are notable

What to Do When Data Fails the Normality Test

A significant normality test result is not the end of the analysis — it is the start of a decision. Four main strategies exist, and the right choice depends on the nature of the non-normality, the sample size, and what analysis you need to run.

🔄

Data Transformations

Apply a mathematical transformation to make the distribution more symmetric. Common choices: log(Y) for right-skewed data, √Y for count data, 1/Y for extreme right skew. Box-Cox optimization selects the best power transformation automatically. Interpret results on the transformed scale or back-transform for reporting.

📊

Nonparametric Tests

Replace the parametric test with a nonparametric equivalent that makes no distributional assumption. Replace a two-sample t-test with the Mann-Whitney U test. Replace a paired t-test with the Wilcoxon signed-rank test. Replace ANOVA with Kruskal-Wallis.

🛡️

Robust Estimation

Use estimation methods that are less sensitive to distribution shape. Trimmed mean procedures remove the most extreme values before computing the mean and standard error. Winsorizing replaces extreme values with the next-largest observed value rather than removing them. Both approaches reduce the influence of outliers that drive non-normality.

🔁

Bootstrap Resampling

Use bootstrap resampling to construct empirical confidence intervals without assuming any specific distribution. The bootstrap repeatedly resamples from the observed data with replacement, building an empirical sampling distribution of your test statistic. This is the most flexible approach for regression and complex models.

Head-to-Head Test Comparisons

Shapiro-Wilk vs. Kolmogorov-Smirnov

These two tests are frequently listed together in SPSS output under "Tests of Normality," which leads many researchers to treat them as interchangeable. They are not. Shapiro-Wilk evaluates normality by comparing the sample's variance structure to what a normal distribution would produce — it considers all pairwise ordered statistics and weights them optimally. This gives it substantially higher statistical power, meaning it is more likely to detect real non-normality when it exists.

The Kolmogorov-Smirnov test (with Lilliefors correction, which SPSS applies automatically) instead measures the maximum absolute distance between the sample's empirical cumulative distribution function and a theoretical normal CDF. This is a simpler criterion and genuinely weaker for detecting subtle departures from normality, particularly in small samples. A published comparison by Razali and Wah (2011) found Shapiro-Wilk outperformed Kolmogorov-Smirnov across all sample sizes and distribution types tested. Use Kolmogorov-Smirnov only when sample sizes are very large (N > 2000) or when software constraints leave no alternative.

Anderson-Darling vs. Jarque-Bera

Anderson-Darling and Jarque-Bera approach the normality question from different angles. Anderson-Darling modifies the Kolmogorov-Smirnov approach by applying quadratic weights to the distribution tails — squared deviations at the extremes count more. This makes it particularly well-suited to analyses where tail behavior matters, such as financial risk modeling, where extreme events have outsized consequences.

Jarque-Bera takes a completely different approach: it tests whether the sample's skewness and excess kurtosis match the values expected under normality (both equal to zero). It does not examine the full distribution shape directly, only two summary moments. This makes it fast and analytically convenient, but it will miss non-normality patterns that do not manifest in skewness or kurtosis — for example, a bimodal distribution with symmetric, moderate-kurtosis shape can fool it entirely. Jarque-Bera is most appropriate for large econometric time series datasets where computational speed matters and skewness/kurtosis departures are the primary concern.

Visual Diagnostics vs. Formal Statistical Tests

This is the most practically important comparison for applied researchers. Formal tests — Shapiro-Wilk, Anderson-Darling, and the others — produce an objective p-value that simplifies reporting and removes subjectivity from the decision. However, they have well-documented size problems: underpowered for small samples (may miss real non-normality) and overpowered for large samples (will flag trivially small deviations as significant). This means the p-value alone can mislead in both directions.

Q-Q plots and histograms provide context about how and where the distribution deviates from normal, which helps you judge whether the deviation will actually distort your analysis. A researcher seeing a Q-Q plot where all points fall within ±0.3 of the reference line can reasonably proceed with a parametric test even if Shapiro-Wilk returned p = 0.04 — the formal test's significance does not translate directly to practical impact. The best practice is to report both and let the reader see the full picture.

Multivariate Normality Tests

When working with multi-dimensional datasets in MANOVA, structural equation modeling (SEM), or discriminant analysis, univariate normality testing is not sufficient. Even if every individual variable looks normally distributed on its own, their joint distribution can still violate multivariate normality assumptions.

🔬

Mardia's Test

Evaluates multivariate skewness and kurtosis separately. Requires large sample sizes (typically N > 200 per variable) to maintain reliable power. Available in R through the MVN package.

📐

Henze-Zirkler Test

Measures distance between the empirical characteristic function and the theoretical multivariate normal. Stable performance across a range of sample sizes. Preferred for medium-sized multivariate datasets.

📊

Royston's Test

Combines individual Shapiro-Wilk transforms into a single multivariate score. Works well for smaller multi-dimensional datasets (N < 200 per variable). Available in R through the MVN and mvnormtest packages.

Common Misconceptions About Normality Testing

Misconception	Incorrect Interpretation	Correct Interpretation
Non-significant result proves normality	p > 0.05 means the data IS normal	p > 0.05 means there is not enough evidence to reject normality — the data is consistent with normal, which is different from proving it
Test raw Y in regression	Check if the outcome variable is normally distributed	Check if the model residuals are normally distributed — the assumption applies to errors, not outcomes
Formal test is always definitive	If Shapiro-Wilk says significant, stop the parametric analysis	Always pair the formal test with a Q-Q plot — a significant p-value with a nearly straight Q-Q may still support parametric use
KS test is as good as Shapiro-Wilk	Both tests in the SPSS output are equally reliable	Shapiro-Wilk has substantially higher power for most sample sizes; the KS test should be a secondary reference at most
Large N makes normality irrelevant	With 500+ observations, I do not need to check normality	Large N makes formal tests hypersensitive, but regression residuals still need checking because normality affects confidence interval validity at all sample sizes

Normality Testing: Frequently Asked Questions

FAQ

Does a non-significant normality test (p > 0.05) prove my data is perfectly normal?

No. A non-significant p-value means there is insufficient evidence to reject the null hypothesis of normality — it does not confirm that the population is exactly normal. The distinction matters particularly for small samples where the test has low power and may miss genuine non-normality. Always supplement the p-value with a visual check.

FAQ

Why does the Kolmogorov-Smirnov test flag almost everything as non-normal in large datasets?

As sample sizes increase, all formal statistical tests gain power. With N = 1000, even a deviation from normality too small to matter practically will produce a p-value below 0.05. This is a feature of hypothesis testing logic, not a flaw in the data. For large samples, the Q-Q plot tells you more about whether the deviation matters than the formal test p-value does.

FAQ

Can I run a normality test on Likert-scale or binary data?

Normality tests apply to continuous numeric data only. Binary data (0/1) or ordinal Likert responses (1–5) violate the continuous distribution assumption underlying Shapiro-Wilk and all similar tests. For these data types, use appropriate nonparametric or categorical methods rather than checking normality first.

FAQ

How many data points do I need for Shapiro-Wilk to be reliable?

The Shapiro-Wilk test is defined for sample sizes from N = 3 to N = 2000, though its power is genuinely limited below N = 10 — the test will rarely reject H₀ even for clearly non-normal data with very small samples. For N between 10 and 30, use Shapiro-Wilk but weight the Q-Q plot heavily in your decision. For N > 2000, switch to Anderson-Darling.

FAQ

Is normality required for simple linear regression?

The normality assumption in simple linear regression applies to the residuals (error terms), not to the predictor X or the outcome Y individually. OLS coefficient estimates are unbiased whether or not residuals are normal — normality becomes important for the validity of t-tests on coefficients, confidence intervals, and prediction intervals. In large samples, the CLT makes these valid even without perfect residual normality.

Formula Glossary

Term	Definition	Formula / Key Value	Interpretation Guide
Shapiro-Wilk (W)	Goodness-of-fit test using regression of ordered statistics against expected normal order statistics	W = (Σ aᵢ x(ᵢ))² / Σ(xᵢ − x̄)²	W ranges from 0 to 1. Values close to 1 support normality. Small p-values indicate significant departure.
Anderson-Darling (A²)	Weighted modification of KS test emphasizing distribution tails	A² = −N − Σ(2i−1)/N × [ln F(xᵢ) + ln(1−F(xₙ₊₁₋ᵢ))]	Larger A² values indicate greater deviation from normality. Compare to critical values for significance.
Skewness (S)	Measure of distribution asymmetry around the mean	S = [Σ(xᵢ−x̄)³/n] / [Σ(xᵢ−x̄)²/n]^(3/2)	S = 0: perfectly symmetric. S > 0: right tail. S < 0: left tail. \|S\| > 1: meaningful skew.
Excess Kurtosis (K)	Measure of tail weight relative to a normal distribution	K = [Σ(xᵢ−x̄)⁴/n] / [Σ(xᵢ−x̄)²/n]² − 3	K = 0: normal tails. K > 0: heavy tails (leptokurtic). K < 0: light tails (platykurtic).
Jarque-Bera (JB)	Asymptotic test combining skewness and kurtosis departures	JB = (N/6) × [S² + (K/4)²]	JB ~ χ²(2) under H₀. Large JB values reject normality. Best for N > 300.
Q-Q Plot	Graphical comparison of empirical vs. theoretical quantiles	x: theoretical normal quantiles, y: observed data quantiles	Points on diagonal: normal. S-curve: kurtosis issue. Curved ends: skew. Scattered points: non-normal.

Normality testing is one piece of the broader statistical workflow. These pages from Statistics Fundamentals cover the analysis steps that come before and after normality assessment.

📈