What Is Pearson Correlation?
The full name is the Pearson product-moment correlation coefficient, introduced by Karl Pearson in 1896 based on earlier work by Francis Galton on regression toward the mean. The "product-moment" refers to the fact that r is computed from the products of mean-centered values — what statisticians call cross-products or moments about the mean.
What r measures specifically is linear association. Two variables can be strongly related in a curved or U-shaped way and still produce r ≈ 0 if the relationship is not well-described by a straight line. Always plot your data in a scatter plot before interpreting r — the number alone does not tell the full story.
A high r only confirms a linear pattern in your sample data. It does not mean one variable causes the other to change. A confounding third variable, coincidence, or reverse causation can all produce the same coefficient. Establishing causation requires controlled experiments or causal inference methods.
Pearson Correlation Formula
The formula comes directly from the definition of covariance. Covariance measures how two variables vary together, but its magnitude depends on the measurement units of each variable. Dividing by both standard deviations removes the units and constrains the result to [−1, +1].
r = Pearson correlation coefficient
xᵢ = each observed X value
yᵢ = each observed Y value
x̄ = mean of X
ȳ = mean of Y
n = number of pairs
The numerator, Σ(xᵢ − x̄)(yᵢ − ȳ), is the sum of cross-products of deviations from the mean — this is n−1 times the sample covariance. The denominator scales it by the product of both standard deviations, making r unitless and bounded.
Equivalent Computational Form
For hand calculation with a data table, the equivalent raw-score formula avoids computing means first:
n = number of data pairs
Σxy = sum of each xᵢyᵢ product
Σx² = sum of squared x values
Population vs Sample Notation
When computing r from a sample to estimate the true population correlation (denoted ρ, the Greek letter rho), use n pairs. The formula above gives the sample r. For the population, replace the sums with expected values: ρ = Cov(X,Y) / (σₓ · σᵧ). In practice you almost always work with sample data, so r is what you calculate.
Coefficient of Determination (R²)
Squaring the Pearson r gives R², the proportion of variance in one variable that is statistically accounted for by linear association with the other. If r = 0.80, then R² = 0.64, meaning 64% of the variance in Y is explained by the linear relationship with X. R² appears again in simple linear regression, where it measures overall model fit.
How to Interpret Pearson r
The sign tells you direction; the absolute value tells you strength. These thresholds are broadly accepted in behavioral and social science research, traced to guidelines proposed by Jacob Cohen (1988). In physics or engineering, tolerances are often much tighter.
Perfect Negative −0.6
Strong Neg −0.3
Weak Neg 0
None +0.3
Weak Pos +0.6
Strong Pos +1.0
Perfect Positive
| r Value | Interpretation | R² (Variance Explained) | Example Context |
|---|---|---|---|
| −1.0 | Perfect negative linear relationship | 100% | Theoretical only |
| −0.80 to −1.0 | Very strong negative | 64–100% | Price vs demand (economics) |
| −0.60 to −0.79 | Strong negative | 36–62% | Exercise frequency vs resting heart rate |
| −0.40 to −0.59 | Moderate negative | 16–35% | Stress vs sleep quality |
| −0.20 to −0.39 | Weak negative | 4–15% | Commute time vs job satisfaction |
| −0.19 to +0.19 | Very weak or no linear relationship | 0–4% | Shoe size vs IQ |
| +0.20 to +0.39 | Weak positive | 4–15% | Height vs shoe size |
| +0.40 to +0.59 | Moderate positive | 16–35% | Study hours vs exam score |
| +0.60 to +0.79 | Strong positive | 36–62% | SAT math vs SAT verbal |
| +0.80 to +1.0 | Very strong positive | 64–100% | Height (cm) vs height (inches) |
| +1.0 | Perfect positive linear relationship | 100% | Same variable measured twice |
With a large sample (n = 1,000), r = 0.08 can be statistically significant (p < 0.05) even though it explains less than 1% of the variance. Always report both r and R², and consider whether the effect size is meaningful for your specific context — not just whether p is below 0.05.
How to Calculate Pearson Correlation (7 Steps)
The following worked calculation uses a real data structure: hours studied per week and final exam percentage for 6 students. Each step maps directly to one part of the formula.
Dataset: n = 6 students. X = weekly study hours, Y = exam percentage.
| Student | X (hours) | Y (score %) |
|---|---|---|
| A | 4 | 65 |
| B | 6 | 70 |
| C | 8 | 75 |
| D | 10 | 82 |
| E | 12 | 88 |
| F | 14 | 93 |
Calculate the means:
x̄ = (4+6+8+10+12+14)/6 = 54/6 = 9.0
ȳ = (65+70+75+82+88+93)/6 = 473/6 = 78.83
Compute deviations from the mean (xᵢ − x̄) and (yᵢ − ȳ):
A: (4−9)=−5, (65−78.83)=−13.83 | B: (6−9)=−3, (70−78.83)=−8.83
C: (8−9)=−1, (75−78.83)=−3.83 | D: (10−9)=+1, (82−78.83)=+3.17
E: (12−9)=+3, (88−78.83)=+9.17 | F: (14−9)=+5, (93−78.83)=+14.17
Compute cross-products (xᵢ − x̄)(yᵢ − ȳ):
A: (−5)(−13.83)=69.15 | B: (−3)(−8.83)=26.49 | C: (−1)(−3.83)=3.83
D: (1)(3.17)=3.17 | E: (3)(9.17)=27.51 | F: (5)(14.17)=70.85
Σ(xᵢ − x̄)(yᵢ − ȳ) = 201.00
Compute Σ(xᵢ − x̄)²:
(−5)²+(−3)²+(−1)²+(1)²+(3)²+(5)² = 25+9+1+1+9+25 = 70
Compute Σ(yᵢ − ȳ)²:
(−13.83)²+(−8.83)²+(−3.83)²+(3.17)²+(9.17)²+(14.17)² ≈ 191.27+77.97+14.67+10.05+84.09+200.79 = 578.84
Apply the formula:
r = 201.00 / √(70 × 578.84) = 201.00 / √40,518.8 = 201.00 / 201.29 ≈ 0.999
Interpret: r = 0.999 indicates a near-perfect positive linear relationship. R² = 0.998, so roughly 99.8% of the variance in exam scores is explained by the linear relationship with weekly study hours in this sample.
✅ r = 0.999. Students who study more hours score higher on exams in an almost perfectly linear pattern. Causation would require a controlled experiment — other variables (ability, prior knowledge) are not held constant here.
Pearson Correlation Calculator
Enter paired X and Y values separated by commas (or spaces). The calculator computes r, R², the t-statistic, p-value, and gives a plain-English interpretation. Separate the two series with a new line, or paste them into the two boxes below.
Pearson r Calculator
Pearson Correlation Assumptions
Pearson r is only a valid and interpretable measure when the following five conditions hold. Violating them does not always make r numerically impossible to compute — it just means the number does not mean what you think it means.
Continuous Variables
Both X and Y must be measured on a continuous interval or ratio scale. Pearson r is not appropriate for ordinal data (ranked categories) or binary variables — use Spearman or point-biserial correlation instead.
Linear Relationship
The underlying relationship between X and Y must be approximately linear. A scatter plot will reveal curves, U-shapes, or other non-linear patterns that Pearson r will underestimate. Spearman captures monotonic (but not necessarily linear) relationships.
Independence of Observations
Each pair (xᵢ, yᵢ) must come from a different, independent unit. Repeated measures from the same individual, time-series data with autocorrelation, or clustered samples all violate this assumption and can inflate r.
No Extreme Outliers
A single outlier can shift r by 0.3 or more in a small sample. Check scatter plots for leverage points before reporting r. Robust alternatives include Spearman's ρ or the winsorized correlation for datasets with outliers.
Approximate Bivariate Normality
For significance testing (the t-test below), the pair (X, Y) should follow an approximate bivariate normal distribution. With large samples (n > 30) this matters less due to the central limit theorem — see the central limit theorem guide. For small n, check histograms and Q-Q plots.
Before reporting r: (1) confirm both variables are continuous, (2) inspect a scatter plot for linearity and outliers, (3) verify observations are independent. These three steps catch the most common errors. Full normality testing matters mainly when n < 30.
Hypothesis Testing for Pearson Correlation
Computing r tells you the sample correlation. To decide whether the result reflects a true population correlation — or could plausibly be produced by chance from data where ρ = 0 — you run a significance test using a t-statistic. This connects directly to hypothesis testing principles.
Setting Up the Hypotheses
- H₀: ρ = 0 — The population correlation is zero; any sample r is due to chance
- H₁: ρ ≠ 0 — Two-tailed test: the population has a nonzero correlation (either direction)
- H₁: ρ > 0 — One-tailed test: the population correlation is positive
- H₁: ρ < 0 — One-tailed test: the population correlation is negative
The t-Statistic
r = sample Pearson coefficient
n = number of pairs
df = n − 2 degrees of freedom
This t-statistic follows a t-distribution with df = n − 2 degrees of freedom under H₀. Compare it to the critical value from a t-distribution table for your α and number of tails, or read the p-value directly from software. The n−2 comes from estimating two parameters (the intercept and slope of the regression line that connects correlation to regression).
Given: r = 0.72, n = 18 pairs, two-tailed test, α = 0.05
Hypotheses: H₀: ρ = 0 | H₁: ρ ≠ 0 (two-tailed)
Degrees of freedom: df = 18 − 2 = 16. From the Pearson correlation table, the critical r at df=16, α=0.05 two-tailed is 0.468.
Calculate t: t = 0.72 × √(16) / √(1 − 0.72²) = 0.72 × 4 / √(1 − 0.518) = 2.88 / √0.482 = 2.88 / 0.694 = 4.15
Critical value: For df=16, two-tailed, α=0.05: t* = ±2.120. Our |t| = 4.15 > 2.120.
p-value: p ≈ 0.0008 (well below 0.05)
✅ Reject H₀. With r = 0.72, t(16) = 4.15, p ≈ 0.001, the correlation is statistically significant at α = 0.05. There is evidence of a positive linear relationship in the population.
APA-Style Reporting
In research papers, report the sample size, r, degrees of freedom, and p-value together: r(16) = .72, p = .001. Some journals also require the 95% confidence interval for r, computed using Fisher's z-transformation.
Pearson Correlation Examples
Example 1 — Health Research: Blood Pressure and Age
A researcher records age (years) and systolic blood pressure (mmHg) for 8 adults to see whether age predicts blood pressure.
| Person | Age (X) | SBP mmHg (Y) |
|---|---|---|
| 1 | 25 | 115 |
| 2 | 31 | 122 |
| 3 | 38 | 127 |
| 4 | 45 | 134 |
| 5 | 52 | 140 |
| 6 | 58 | 148 |
| 7 | 65 | 155 |
| 8 | 72 | 162 |
x̄ = 48.25, ȳ = 137.875
Σ(xᵢ−x̄)(yᵢ−ȳ) = 1,382.25 | Σ(xᵢ−x̄)² = 1,848.5 | Σ(yᵢ−ȳ)² = 1,035.875
r = 1,382.25 / √(1,848.5 × 1,035.875) = 1,382.25 / √1,914,802.3 ≈ 1,382.25 / 1,383.76 ≈ 0.999
✅ r ≈ 0.999. Systolic blood pressure rises in near-perfect linear proportion with age in this sample. R² ≈ 0.998. Note: this sample is small and this relationship would require a larger study before drawing medical conclusions.
Example 2 — Marketing: Advertising Spend vs Revenue
Monthly ad spend ($000s) and revenue ($000s) across 6 months. Does advertising drive revenue in this dataset?
| Month | Ad Spend X ($k) | Revenue Y ($k) |
|---|---|---|
| Jan | 10 | 82 |
| Feb | 14 | 97 |
| Mar | 18 | 103 |
| Apr | 20 | 114 |
| May | 25 | 125 |
| Jun | 30 | 141 |
x̄ = 19.5, ȳ = 110.33
Σ(xᵢ−x̄)(yᵢ−ȳ) ≈ 571 | Σ(xᵢ−x̄)² = 290.5 | Σ(yᵢ−ȳ)² ≈ 1,140.3
r = 571 / √(290.5 × 1,140.3) ≈ 571 / √331,257 ≈ 571 / 575.6 ≈ 0.992
✅ r ≈ 0.992 (very strong positive). R² ≈ 0.984. Nearly all variance in monthly revenue is explained by the linear relationship with advertising spend. To model this formally, use simple linear regression.
Pearson vs Spearman vs Kendall
Three correlation coefficients are used routinely in statistics. Choosing the wrong one gives a number that does not answer your actual question. The decision depends on your data type, whether you expect a linear or just monotonic relationship, and how sensitive you need to be to outliers.
| Feature | Pearson r | Spearman ρ | Kendall τ |
|---|---|---|---|
| Relationship type measured | Linear only | Monotonic (any direction) | Monotonic (concordance-based) |
| Data type required | Continuous (interval/ratio) | Ordinal or continuous | Ordinal or continuous |
| Distributional assumption | Approximate bivariate normality | None (non-parametric) | None (non-parametric) |
| Sensitivity to outliers | High | Low (ranks reduce influence) | Low |
| Effect of tied ranks | N/A | Requires tie correction | Handles ties naturally |
| Preferred sample size | n ≥ 10, larger better | n ≥ 10 | Better with small n |
| Common use cases | Physical measurements, finance, psychological scales | Survey Likert data, non-normal variables | Small samples, heavy ties |
A simple rule: use Pearson when your data is continuous and you have checked the scatter plot for linearity and absence of extreme outliers. Switch to Spearman when the relationship might be monotonic-but-curved, when data is ordinal, or when outliers are present. Kendall tau is preferred for very small samples or datasets with many tied values.
Pearson r and simple linear regression answer related but different questions. r measures the strength of the linear association (symmetric — it does not matter which variable is X or Y). Regression estimates the predicted change in Y for each one-unit change in X (directional). When you need to predict or control for multiple variables, move from correlation to multiple linear regression.
Real-World Applications
Pearson correlation appears in virtually every quantitative field. Below are eight domains where it is routinely used as a first-pass analytical tool before more complex modelling.
Medical Research
Correlating biomarkers — cholesterol levels vs cardiovascular risk, age vs bone density — to identify variables worth investigating in clinical trials.
Finance
Measuring portfolio diversification: a low r between two assets means holding both reduces total risk. Pairs trading identifies stocks with high historical r.
Education Research
Relating study time, attendance, or prior grades to exam outcomes. Helps curriculum designers identify which inputs predict achievement.
Marketing Analytics
Connecting advertising spend to conversion rates, or customer satisfaction scores to retention. Guides budget allocation decisions.
Machine Learning
Feature selection: high r between a feature and the target suggests predictive value. High r between two features (multicollinearity) can harm regression models.
Psychology
Relating test scores across cognitive domains, validating psychometric instruments, and studying personality trait associations in survey research.
Economics
GDP growth vs unemployment, inflation vs interest rates, trade volume vs currency strength — correlations that inform macroeconomic forecasting.
Environmental Science
Linking temperature changes to species distribution shifts, or precipitation levels to crop yields across geographic regions.
Pearson Correlation Matrix
When you have more than two variables, computing r for every pair produces a correlation matrix. Each cell shows the Pearson r between that row variable and that column variable. The diagonal is always 1.0 (a variable is perfectly correlated with itself). The matrix is symmetric: r(X,Y) = r(Y,X).
| Variable | Age | Income | Education (yrs) | Health Score |
|---|---|---|---|---|
| Age | 1.00 | 0.24 | −0.08 | −0.41 |
| Income | 0.24 | 1.00 | 0.61 | 0.38 |
| Education (yrs) | −0.08 | 0.61 | 1.00 | 0.29 |
| Health Score | −0.41 | 0.38 | 0.29 | 1.00 |
Reading this example matrix: Income and Education have the strongest correlation (r = 0.61). Age has a moderate negative relationship with Health Score (r = −0.41), meaning older individuals in this sample tend to have lower health scores. Before building a regression model with multiple predictors, scan the matrix for high pairwise correlations (|r| > 0.70) between predictor variables, which would signal multicollinearity — an issue addressed in the multiple linear regression guide.
Common Mistakes and Misconceptions
| Mistake | What People Think | What Is Actually True |
|---|---|---|
| Confusing r with R² | r = 0.70 means 70% of variance explained | R² = 0.70² = 0.49, so only 49% is explained |
| Inferring causation | High r means X causes Y | Correlation only confirms a linear pattern; causation needs experimental design |
| Ignoring the scatter plot | r tells the whole story | Anscombe's quartet shows four datasets with identical r but completely different patterns |
| Non-linear data | r = 0 means no relationship | A perfect U-shaped relationship produces r ≈ 0; Pearson misses non-linear patterns |
| Using r with ordinal data | Likert scale data is "basically continuous" | Ordinal variables require Spearman; Pearson assumes equal intervals between values |
| Truncated range | r reflects the true population relationship | Sampling only a restricted range of X can dramatically reduce r (range restriction bias) |
Frequently Asked Questions
Can Pearson r be negative?
Yes. A negative r means the two variables move in opposite directions: as X increases, Y tends to decrease. For example, r between hours of sleep deprivation and cognitive performance is negative — more deprivation, lower performance. The strength interpretation uses the absolute value; r = −0.75 and r = +0.75 describe equally strong relationships, just in opposite directions.
What sample size does Pearson correlation need?
There is no single rule, but r becomes unstable in very small samples (n < 10). For n = 5, a sample r of 0.80 is not statistically significant at α = 0.05. A common practical minimum is n ≥ 30 for the central limit theorem to give reliable p-values. For power to detect r = 0.30 at 80% power with α = 0.05, you need roughly n = 84 pairs — use a dedicated power calculator. The sample size calculator can help with planning.
What is Anscombe's Quartet?
Francis Anscombe constructed four datasets in 1973, each with nearly identical means, variances, and Pearson r ≈ 0.816, but completely different scatter plots — one linear, one curved, one with a single outlier driving the correlation, one with a vertical cluster. The quartet is a classic demonstration that r must always be paired with a scatter plot. It is why the first step in any correlation analysis is visualizing the data.
How does Pearson r relate to linear regression slope?
In simple linear regression, the standardized slope (beta coefficient) equals the Pearson r when both variables are z-scored. More concretely, the slope b in Y = a + bX relates to r via: b = r · (sᵧ / sₓ), where sᵧ and sₓ are the standard deviations of Y and X. So r and the regression slope carry the same directional information, but r is dimensionless while b has the units of Y per unit of X.
What is Fisher's z-transformation?
Because r does not follow a normal distribution (especially near ±1), comparing two correlations or computing confidence intervals requires converting r to Fisher's z: z = 0.5 · ln[(1+r)/(1−r)]. The z-score is approximately normally distributed with standard error 1/√(n−3), which makes it suitable for building confidence intervals or testing whether two independent r values differ significantly.
When should I use the Pearson correlation table?
The Pearson correlation table gives critical r values for specific degrees of freedom (df = n−2) and significance levels. If your calculated |r| exceeds the critical value in the table, the correlation is statistically significant. It is the manual alternative to computing a t-statistic when doing by-hand work or checking software output.
Pearson Correlation in Software
Python (SciPy)
import numpy as np
x = np.array([4, 6, 8, 10, 12, 14])
y = np.array([65, 70, 75, 82, 88, 93])
r, p_value = stats.pearsonr(x, y)
print(f"r = {r:.4f}, p = {p_value:.4f}")
R
y <- c(65, 70, 75, 82, 88, 93)
cor.test(x, y, method = "pearson")
# Returns r, t, df, p-value, and 95% CI
Excel
Use the built-in function =CORREL(array1, array2) to compute r directly from two columns of data. For the p-value, you need to compute the t-statistic manually: =r*SQRT(n-2)/SQRT(1-r^2), then use =T.DIST.2T(ABS(t), n-2) for a two-tailed p-value. The online correlation calculator handles this automatically.
Pearson Correlation Cheat Sheet
| Item | Formula / Value | Notes |
|---|---|---|
| Sample Pearson r | Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)²] | Ranges from −1 to +1 |
| Raw-score formula | [nΣxy − ΣxΣy] / √{[nΣx²−(Σx)²][nΣy²−(Σy)²]} | Easier for hand computation |
| Population correlation | ρ = Cov(X,Y) / (σₓ · σᵧ) | Estimated by sample r |
| Coefficient of determination | R² = r² | Proportion of variance explained |
| t-test statistic | t = r√(n−2) / √(1−r²) | df = n − 2 |
| Fisher's z-transform | z = 0.5 · ln[(1+r)/(1−r)] | Used for CIs and comparing r values |
| Interpretation: weak | |r| = 0.10 to 0.29 | Cohen's (1988) benchmarks |
| Interpretation: moderate | |r| = 0.30 to 0.49 | |
| Interpretation: strong | |r| = 0.50 and above | Context-dependent |
| Null hypothesis | H₀: ρ = 0 | No population linear association |
| Decision rule | Reject H₀ if |t| > t*(df, α) | Or if p < α |
| Key assumption | Linear relationship + continuous data | Check with scatter plot first |