Correlation Analysis Inferential Statistics Bivariate Statistics 32 min read June 10, 2026
BY: Statistics Fundamentals Team
Reviewed By: Minsa A (Senior Statistics Editor)

Pearson Correlation: Coefficient Guide, Formula & Interpretation

A student scores 78 on a practice exam and 82 on the real one. Another student scored 65 and 70. A third, 90 and 91. Is there a pattern? Does studying more actually predict a higher grade? Pearson correlation gives you a single number — r — that measures exactly how tightly two variables move together in a straight line, and whether that direction is positive or negative.

This guide covers the complete Pearson correlation coefficient: its definition, formula, step-by-step calculation, interpretation scale, five assumptions, significance testing with a t-statistic, worked examples across three fields, comparisons with Spearman and other methods, and an interactive calculator. It also explains the one thing correlation cannot tell you — causation.

What You'll Learn
  • ✓ The definition and intuition behind Pearson's r
  • ✓ The exact formula, derived from covariance
  • ✓ A 7-step manual calculation with a real dataset
  • ✓ How to interpret r values — the full scale from ±0 to ±1
  • ✓ The five assumptions that must hold for r to be valid
  • ✓ Hypothesis testing for correlation significance
  • ✓ Pearson vs Spearman vs Kendall — when to use each
  • ✓ Real-world applications in research, business, and data science

What Is Pearson Correlation?

Definition — Pearson Correlation Coefficient
Pearson correlation is a standardized measure of the strength and direction of a linear relationship between two continuous variables. The coefficient, written r, ranges from −1 to +1. A value near +1 means both variables tend to increase together; near −1 means one rises as the other falls; near 0 means no detectable linear pattern.
−1 ≤ r ≤ +1

The full name is the Pearson product-moment correlation coefficient, introduced by Karl Pearson in 1896 based on earlier work by Francis Galton on regression toward the mean. The "product-moment" refers to the fact that r is computed from the products of mean-centered values — what statisticians call cross-products or moments about the mean.

What r measures specifically is linear association. Two variables can be strongly related in a curved or U-shaped way and still produce r ≈ 0 if the relationship is not well-described by a straight line. Always plot your data in a scatter plot before interpreting r — the number alone does not tell the full story.

⚠️
Correlation ≠ Causation

A high r only confirms a linear pattern in your sample data. It does not mean one variable causes the other to change. A confounding third variable, coincidence, or reverse causation can all produce the same coefficient. Establishing causation requires controlled experiments or causal inference methods.

+1.0
Perfect positive linear relationship
0
No linear association
−1.0
Perfect negative linear relationship
Proportion of variance explained (R²)
Karl Pearson (1896). "Mathematical Contributions to the Theory of Evolution." Philosophical Transactions of the Royal Society A, 187, 253–318. The modern formula and notation are documented in the NIST Engineering Statistics Handbook §6.3.5.

Pearson Correlation Formula

The formula comes directly from the definition of covariance. Covariance measures how two variables vary together, but its magnitude depends on the measurement units of each variable. Dividing by both standard deviations removes the units and constrains the result to [−1, +1].

Pearson Correlation Coefficient — Sample Formula
r = Σ(xᵢ − x̄)(yᵢ − ȳ) / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]
r = Pearson correlation coefficient xᵢ = each observed X value yᵢ = each observed Y value = mean of X ȳ = mean of Y n = number of pairs

The numerator, Σ(xᵢ − x̄)(yᵢ − ȳ), is the sum of cross-products of deviations from the mean — this is n−1 times the sample covariance. The denominator scales it by the product of both standard deviations, making r unitless and bounded.

Equivalent Computational Form

For hand calculation with a data table, the equivalent raw-score formula avoids computing means first:

Raw-Score Computational Formula
r = [nΣxy − ΣxΣy] / √{[nΣx² − (Σx)²][nΣy² − (Σy)²]}
n = number of data pairs Σxy = sum of each xᵢyᵢ product Σx² = sum of squared x values

Population vs Sample Notation

When computing r from a sample to estimate the true population correlation (denoted ρ, the Greek letter rho), use n pairs. The formula above gives the sample r. For the population, replace the sums with expected values: ρ = Cov(X,Y) / (σₓ · σᵧ). In practice you almost always work with sample data, so r is what you calculate.

Coefficient of Determination (R²)

Squaring the Pearson r gives R², the proportion of variance in one variable that is statistically accounted for by linear association with the other. If r = 0.80, then R² = 0.64, meaning 64% of the variance in Y is explained by the linear relationship with X. R² appears again in simple linear regression, where it measures overall model fit.

Coefficient of Determination
R² = r²
Ranges from 0 (no explanation) to 1 (perfect explanation)

How to Interpret Pearson r

The sign tells you direction; the absolute value tells you strength. These thresholds are broadly accepted in behavioral and social science research, traced to guidelines proposed by Jacob Cohen (1988). In physics or engineering, tolerances are often much tighter.

Pearson r Scale — Direction and Strength
−1.0
Perfect Negative
−0.6
Strong Neg
−0.3
Weak Neg
0
None
+0.3
Weak Pos
+0.6
Strong Pos
+1.0
Perfect Positive
r Value Interpretation R² (Variance Explained) Example Context
−1.0Perfect negative linear relationship100%Theoretical only
−0.80 to −1.0Very strong negative64–100%Price vs demand (economics)
−0.60 to −0.79Strong negative36–62%Exercise frequency vs resting heart rate
−0.40 to −0.59Moderate negative16–35%Stress vs sleep quality
−0.20 to −0.39Weak negative4–15%Commute time vs job satisfaction
−0.19 to +0.19Very weak or no linear relationship0–4%Shoe size vs IQ
+0.20 to +0.39Weak positive4–15%Height vs shoe size
+0.40 to +0.59Moderate positive16–35%Study hours vs exam score
+0.60 to +0.79Strong positive36–62%SAT math vs SAT verbal
+0.80 to +1.0Very strong positive64–100%Height (cm) vs height (inches)
+1.0Perfect positive linear relationship100%Same variable measured twice
📌
Statistical vs Practical Significance

With a large sample (n = 1,000), r = 0.08 can be statistically significant (p < 0.05) even though it explains less than 1% of the variance. Always report both r and R², and consider whether the effect size is meaningful for your specific context — not just whether p is below 0.05.

How to Calculate Pearson Correlation (7 Steps)

The following worked calculation uses a real data structure: hours studied per week and final exam percentage for 6 students. Each step maps directly to one part of the formula.

Manual Calculation — Study Hours vs Exam Score

Dataset: n = 6 students. X = weekly study hours, Y = exam percentage.

StudentX (hours)Y (score %)
A465
B670
C875
D1082
E1288
F1493
1

Calculate the means:
x̄ = (4+6+8+10+12+14)/6 = 54/6 = 9.0
ȳ = (65+70+75+82+88+93)/6 = 473/6 = 78.83

2

Compute deviations from the mean (xᵢ − x̄) and (yᵢ − ȳ):
A: (4−9)=−5, (65−78.83)=−13.83  |  B: (6−9)=−3, (70−78.83)=−8.83
C: (8−9)=−1, (75−78.83)=−3.83  |  D: (10−9)=+1, (82−78.83)=+3.17
E: (12−9)=+3, (88−78.83)=+9.17  |  F: (14−9)=+5, (93−78.83)=+14.17

3

Compute cross-products (xᵢ − x̄)(yᵢ − ȳ):
A: (−5)(−13.83)=69.15  |  B: (−3)(−8.83)=26.49  |  C: (−1)(−3.83)=3.83
D: (1)(3.17)=3.17  |  E: (3)(9.17)=27.51  |  F: (5)(14.17)=70.85
Σ(xᵢ − x̄)(yᵢ − ȳ) = 201.00

4

Compute Σ(xᵢ − x̄)²:
(−5)²+(−3)²+(−1)²+(1)²+(3)²+(5)² = 25+9+1+1+9+25 = 70

5

Compute Σ(yᵢ − ȳ)²:
(−13.83)²+(−8.83)²+(−3.83)²+(3.17)²+(9.17)²+(14.17)² ≈ 191.27+77.97+14.67+10.05+84.09+200.79 = 578.84

6

Apply the formula:
r = 201.00 / √(70 × 578.84) = 201.00 / √40,518.8 = 201.00 / 201.29 ≈ 0.999

7

Interpret: r = 0.999 indicates a near-perfect positive linear relationship. R² = 0.998, so roughly 99.8% of the variance in exam scores is explained by the linear relationship with weekly study hours in this sample.

✅ r = 0.999. Students who study more hours score higher on exams in an almost perfectly linear pattern. Causation would require a controlled experiment — other variables (ability, prior knowledge) are not held constant here.

Calculation methodology follows NIST Engineering Statistics Handbook Chapter 6. For software implementation, see the SciPy pearsonr documentation.

Pearson Correlation Calculator

Enter paired X and Y values separated by commas (or spaces). The calculator computes r, R², the t-statistic, p-value, and gives a plain-English interpretation. Separate the two series with a new line, or paste them into the two boxes below.

Pearson r Calculator

Pearson Correlation Assumptions

Pearson r is only a valid and interpretable measure when the following five conditions hold. Violating them does not always make r numerically impossible to compute — it just means the number does not mean what you think it means.

Assumption 1

Continuous Variables

Both X and Y must be measured on a continuous interval or ratio scale. Pearson r is not appropriate for ordinal data (ranked categories) or binary variables — use Spearman or point-biserial correlation instead.

Assumption 2

Linear Relationship

The underlying relationship between X and Y must be approximately linear. A scatter plot will reveal curves, U-shapes, or other non-linear patterns that Pearson r will underestimate. Spearman captures monotonic (but not necessarily linear) relationships.

Assumption 3

Independence of Observations

Each pair (xᵢ, yᵢ) must come from a different, independent unit. Repeated measures from the same individual, time-series data with autocorrelation, or clustered samples all violate this assumption and can inflate r.

Assumption 4

No Extreme Outliers

A single outlier can shift r by 0.3 or more in a small sample. Check scatter plots for leverage points before reporting r. Robust alternatives include Spearman's ρ or the winsorized correlation for datasets with outliers.

Assumption 5

Approximate Bivariate Normality

For significance testing (the t-test below), the pair (X, Y) should follow an approximate bivariate normal distribution. With large samples (n > 30) this matters less due to the central limit theorem — see the central limit theorem guide. For small n, check histograms and Q-Q plots.

Quick Assumption Check

Before reporting r: (1) confirm both variables are continuous, (2) inspect a scatter plot for linearity and outliers, (3) verify observations are independent. These three steps catch the most common errors. Full normality testing matters mainly when n < 30.

Hypothesis Testing for Pearson Correlation

Computing r tells you the sample correlation. To decide whether the result reflects a true population correlation — or could plausibly be produced by chance from data where ρ = 0 — you run a significance test using a t-statistic. This connects directly to hypothesis testing principles.

Setting Up the Hypotheses

Hypotheses for Pearson Correlation Test
  • H₀: ρ = 0 — The population correlation is zero; any sample r is due to chance
  • H₁: ρ ≠ 0 — Two-tailed test: the population has a nonzero correlation (either direction)
  • H₁: ρ > 0 — One-tailed test: the population correlation is positive
  • H₁: ρ < 0 — One-tailed test: the population correlation is negative

The t-Statistic

t-Test for Pearson Correlation
t = r · √(n − 2) / √(1 − r²)
r = sample Pearson coefficient n = number of pairs df = n − 2 degrees of freedom

This t-statistic follows a t-distribution with df = n − 2 degrees of freedom under H₀. Compare it to the critical value from a t-distribution table for your α and number of tails, or read the p-value directly from software. The n−2 comes from estimating two parameters (the intercept and slope of the regression line that connects correlation to regression).

Significance Test — Is r = 0.72 significant at α = 0.05?

Given: r = 0.72, n = 18 pairs, two-tailed test, α = 0.05

1

Hypotheses: H₀: ρ = 0  |  H₁: ρ ≠ 0 (two-tailed)

2

Degrees of freedom: df = 18 − 2 = 16. From the Pearson correlation table, the critical r at df=16, α=0.05 two-tailed is 0.468.

3

Calculate t: t = 0.72 × √(16) / √(1 − 0.72²) = 0.72 × 4 / √(1 − 0.518) = 2.88 / √0.482 = 2.88 / 0.694 = 4.15

4

Critical value: For df=16, two-tailed, α=0.05: t* = ±2.120. Our |t| = 4.15 > 2.120.

5

p-value: p ≈ 0.0008 (well below 0.05)

✅ Reject H₀. With r = 0.72, t(16) = 4.15, p ≈ 0.001, the correlation is statistically significant at α = 0.05. There is evidence of a positive linear relationship in the population.

APA-Style Reporting

In research papers, report the sample size, r, degrees of freedom, and p-value together: r(16) = .72, p = .001. Some journals also require the 95% confidence interval for r, computed using Fisher's z-transformation.

Pearson Correlation Examples

Example 1 — Health Research: Blood Pressure and Age

Worked Example — Clinical Research

A researcher records age (years) and systolic blood pressure (mmHg) for 8 adults to see whether age predicts blood pressure.

PersonAge (X)SBP mmHg (Y)
125115
231122
338127
445134
552140
658148
765155
872162
1

x̄ = 48.25, ȳ = 137.875

2

Σ(xᵢ−x̄)(yᵢ−ȳ) = 1,382.25  |  Σ(xᵢ−x̄)² = 1,848.5  |  Σ(yᵢ−ȳ)² = 1,035.875

3

r = 1,382.25 / √(1,848.5 × 1,035.875) = 1,382.25 / √1,914,802.3 ≈ 1,382.25 / 1,383.76 ≈ 0.999

✅ r ≈ 0.999. Systolic blood pressure rises in near-perfect linear proportion with age in this sample. R² ≈ 0.998. Note: this sample is small and this relationship would require a larger study before drawing medical conclusions.

Example 2 — Marketing: Advertising Spend vs Revenue

Worked Example — Business Analytics

Monthly ad spend ($000s) and revenue ($000s) across 6 months. Does advertising drive revenue in this dataset?

MonthAd Spend X ($k)Revenue Y ($k)
Jan1082
Feb1497
Mar18103
Apr20114
May25125
Jun30141
1

x̄ = 19.5, ȳ = 110.33

2

Σ(xᵢ−x̄)(yᵢ−ȳ) ≈ 571  |  Σ(xᵢ−x̄)² = 290.5  |  Σ(yᵢ−ȳ)² ≈ 1,140.3

3

r = 571 / √(290.5 × 1,140.3) ≈ 571 / √331,257 ≈ 571 / 575.6 ≈ 0.992

✅ r ≈ 0.992 (very strong positive). R² ≈ 0.984. Nearly all variance in monthly revenue is explained by the linear relationship with advertising spend. To model this formally, use simple linear regression.

Pearson vs Spearman vs Kendall

Three correlation coefficients are used routinely in statistics. Choosing the wrong one gives a number that does not answer your actual question. The decision depends on your data type, whether you expect a linear or just monotonic relationship, and how sensitive you need to be to outliers.

Feature Pearson r Spearman ρ Kendall τ
Relationship type measuredLinear onlyMonotonic (any direction)Monotonic (concordance-based)
Data type requiredContinuous (interval/ratio)Ordinal or continuousOrdinal or continuous
Distributional assumptionApproximate bivariate normalityNone (non-parametric)None (non-parametric)
Sensitivity to outliersHighLow (ranks reduce influence)Low
Effect of tied ranksN/ARequires tie correctionHandles ties naturally
Preferred sample sizen ≥ 10, larger bettern ≥ 10Better with small n
Common use casesPhysical measurements, finance, psychological scalesSurvey Likert data, non-normal variablesSmall samples, heavy ties

A simple rule: use Pearson when your data is continuous and you have checked the scatter plot for linearity and absence of extreme outliers. Switch to Spearman when the relationship might be monotonic-but-curved, when data is ordinal, or when outliers are present. Kendall tau is preferred for very small samples or datasets with many tied values.

📊
Pearson vs Regression

Pearson r and simple linear regression answer related but different questions. r measures the strength of the linear association (symmetric — it does not matter which variable is X or Y). Regression estimates the predicted change in Y for each one-unit change in X (directional). When you need to predict or control for multiple variables, move from correlation to multiple linear regression.

Real-World Applications

Pearson correlation appears in virtually every quantitative field. Below are eight domains where it is routinely used as a first-pass analytical tool before more complex modelling.

🧬

Medical Research

Correlating biomarkers — cholesterol levels vs cardiovascular risk, age vs bone density — to identify variables worth investigating in clinical trials.

📈

Finance

Measuring portfolio diversification: a low r between two assets means holding both reduces total risk. Pairs trading identifies stocks with high historical r.

🎓

Education Research

Relating study time, attendance, or prior grades to exam outcomes. Helps curriculum designers identify which inputs predict achievement.

🛒

Marketing Analytics

Connecting advertising spend to conversion rates, or customer satisfaction scores to retention. Guides budget allocation decisions.

🤖

Machine Learning

Feature selection: high r between a feature and the target suggests predictive value. High r between two features (multicollinearity) can harm regression models.

🧠

Psychology

Relating test scores across cognitive domains, validating psychometric instruments, and studying personality trait associations in survey research.

🌍

Economics

GDP growth vs unemployment, inflation vs interest rates, trade volume vs currency strength — correlations that inform macroeconomic forecasting.

🌱

Environmental Science

Linking temperature changes to species distribution shifts, or precipitation levels to crop yields across geographic regions.

Pearson Correlation Matrix

When you have more than two variables, computing r for every pair produces a correlation matrix. Each cell shows the Pearson r between that row variable and that column variable. The diagonal is always 1.0 (a variable is perfectly correlated with itself). The matrix is symmetric: r(X,Y) = r(Y,X).

Variable Age Income Education (yrs) Health Score
Age1.000.24−0.08−0.41
Income0.241.000.610.38
Education (yrs)−0.080.611.000.29
Health Score−0.410.380.291.00

Reading this example matrix: Income and Education have the strongest correlation (r = 0.61). Age has a moderate negative relationship with Health Score (r = −0.41), meaning older individuals in this sample tend to have lower health scores. Before building a regression model with multiple predictors, scan the matrix for high pairwise correlations (|r| > 0.70) between predictor variables, which would signal multicollinearity — an issue addressed in the multiple linear regression guide.

Common Mistakes and Misconceptions

Mistake What People Think What Is Actually True
Confusing r with R² r = 0.70 means 70% of variance explained R² = 0.70² = 0.49, so only 49% is explained
Inferring causation High r means X causes Y Correlation only confirms a linear pattern; causation needs experimental design
Ignoring the scatter plot r tells the whole story Anscombe's quartet shows four datasets with identical r but completely different patterns
Non-linear data r = 0 means no relationship A perfect U-shaped relationship produces r ≈ 0; Pearson misses non-linear patterns
Using r with ordinal data Likert scale data is "basically continuous" Ordinal variables require Spearman; Pearson assumes equal intervals between values
Truncated range r reflects the true population relationship Sampling only a restricted range of X can dramatically reduce r (range restriction bias)

Frequently Asked Questions

Can Pearson r be negative?

Yes. A negative r means the two variables move in opposite directions: as X increases, Y tends to decrease. For example, r between hours of sleep deprivation and cognitive performance is negative — more deprivation, lower performance. The strength interpretation uses the absolute value; r = −0.75 and r = +0.75 describe equally strong relationships, just in opposite directions.

What sample size does Pearson correlation need?

There is no single rule, but r becomes unstable in very small samples (n < 10). For n = 5, a sample r of 0.80 is not statistically significant at α = 0.05. A common practical minimum is n ≥ 30 for the central limit theorem to give reliable p-values. For power to detect r = 0.30 at 80% power with α = 0.05, you need roughly n = 84 pairs — use a dedicated power calculator. The sample size calculator can help with planning.

What is Anscombe's Quartet?

Francis Anscombe constructed four datasets in 1973, each with nearly identical means, variances, and Pearson r ≈ 0.816, but completely different scatter plots — one linear, one curved, one with a single outlier driving the correlation, one with a vertical cluster. The quartet is a classic demonstration that r must always be paired with a scatter plot. It is why the first step in any correlation analysis is visualizing the data.

How does Pearson r relate to linear regression slope?

In simple linear regression, the standardized slope (beta coefficient) equals the Pearson r when both variables are z-scored. More concretely, the slope b in Y = a + bX relates to r via: b = r · (sᵧ / sₓ), where sᵧ and sₓ are the standard deviations of Y and X. So r and the regression slope carry the same directional information, but r is dimensionless while b has the units of Y per unit of X.

What is Fisher's z-transformation?

Because r does not follow a normal distribution (especially near ±1), comparing two correlations or computing confidence intervals requires converting r to Fisher's z: z = 0.5 · ln[(1+r)/(1−r)]. The z-score is approximately normally distributed with standard error 1/√(n−3), which makes it suitable for building confidence intervals or testing whether two independent r values differ significantly.

When should I use the Pearson correlation table?

The Pearson correlation table gives critical r values for specific degrees of freedom (df = n−2) and significance levels. If your calculated |r| exceeds the critical value in the table, the correlation is statistically significant. It is the manual alternative to computing a t-statistic when doing by-hand work or checking software output.

Pearson Correlation in Software

Python (SciPy)

Python — SciPy
from scipy import stats
import numpy as np

x = np.array([4, 6, 8, 10, 12, 14])
y = np.array([65, 70, 75, 82, 88, 93])

r, p_value = stats.pearsonr(x, y)
print(f"r = {r:.4f}, p = {p_value:.4f}")

R

R
x <- c(4, 6, 8, 10, 12, 14)
y <- c(65, 70, 75, 82, 88, 93)

cor.test(x, y, method = "pearson")
# Returns r, t, df, p-value, and 95% CI

Excel

Use the built-in function =CORREL(array1, array2) to compute r directly from two columns of data. For the p-value, you need to compute the t-statistic manually: =r*SQRT(n-2)/SQRT(1-r^2), then use =T.DIST.2T(ABS(t), n-2) for a two-tailed p-value. The online correlation calculator handles this automatically.

Pearson Correlation Cheat Sheet

ItemFormula / ValueNotes
Sample Pearson rΣ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)²]Ranges from −1 to +1
Raw-score formula[nΣxy − ΣxΣy] / √{[nΣx²−(Σx)²][nΣy²−(Σy)²]}Easier for hand computation
Population correlationρ = Cov(X,Y) / (σₓ · σᵧ)Estimated by sample r
Coefficient of determinationR² = r²Proportion of variance explained
t-test statistict = r√(n−2) / √(1−r²)df = n − 2
Fisher's z-transformz = 0.5 · ln[(1+r)/(1−r)]Used for CIs and comparing r values
Interpretation: weak|r| = 0.10 to 0.29Cohen's (1988) benchmarks
Interpretation: moderate|r| = 0.30 to 0.49
Interpretation: strong|r| = 0.50 and aboveContext-dependent
Null hypothesisH₀: ρ = 0No population linear association
Decision ruleReject H₀ if |t| > t*(df, α)Or if p < α
Key assumptionLinear relationship + continuous dataCheck with scatter plot first