What is the difference between a scatter plot and a line graph?

A scatter plot displays the relationship between two variables using individual data points, with no connecting lines. A line graph connects data points to show change over time. Use scatter plots when exploring correlation between two distinct variables; use line graphs for tracking trends across a continuous time sequence.

How do you interpret a scatter plot?

Interpret a scatter plot by examining the direction, form, and strength of the data point pattern. An upward slope signals positive correlation; downward signals negative. Tightly clustered points indicate a strong relationship; widely scattered points suggest a weak or no relationship. Always check for outliers that may distort the overall pattern.

What does a correlation coefficient of 0.8 mean?

A correlation coefficient (r) of 0.8 indicates a strong positive correlation between two variables. As one variable increases, the other tends to increase consistently. Values above 0.7 are generally considered strong, 0.4–0.7 moderate, and below 0.4 weak in most social sciences.

Can a scatter plot show causation?

No. A scatter plot shows correlation — a statistical association — but never proves causation. Two variables may be strongly correlated due to a confounding variable or pure coincidence. Establishing causation requires controlled experimentation or robust causal inference methods, not observational data alone.

What does R-squared mean in a scatter plot?

R-squared (coefficient of determination) measures how much of the variation in the Y variable is explained by the X variable. For example, R² = 0.81 means 81% of the variation in Y is accounted for by X. It is calculated as R² = r². Higher values indicate better model fit.

Scatter Plots & Correlation: 7 Essential Concepts Explained

1. What Is a Scatter Plot?

Position 0 Definition — Scatter Plot

A scatter plot (also called a scatter diagram or scattergram) is a two-dimensional graph that displays the relationship between two quantitative variables. Each observation becomes a single dot: the X-axis position records one variable, the Y-axis records the other. The resulting pattern of dots reveals whether the two variables move together — and if so, how strongly and in which direction.

Key distinction: Scatter plots reveal association, not causation. Interpreting them otherwise is the most common analytical error in applied statistics.

The X-axis typically holds the independent variable — the one you control or suspect is doing the influencing (e.g., hours studied). The Y-axis holds the dependent variable — the outcome you are measuring (e.g., exam score). That said, the choice of axis does not impose a causal claim; it is a convention for legibility.

Anatomy of a Scatter Plot

Scatter plot anatomy: each dot is one student (data point). The dashed red line is the regression line. The upward direction signals a positive correlation between study hours and exam scores.

The five components every scatter plot needs: a labeled X-axis, a labeled Y-axis with units, individual data points, a title describing the relationship, and — optionally — a trend line to make the direction explicit. Without axis labels, a scatter plot communicates nothing.

⚡ Quick Reference — Scatter Plot at a Glance

X-axis: Independent variable (predictor) — what you control or suspect causes change
Y-axis: Dependent variable (outcome) — what you measure as a response
Each dot: One observation — one pair of (x, y) values
Pattern direction: Upward slope = positive; downward slope = negative; no slope = zero correlation
Pattern tightness: Tightly clustered = strong; widely scattered = weak or none
Line of best fit: The least-squares regression line minimizing the sum of squared residuals

2. Types of Correlation in Scatter Plots

The direction and strength of a scatter plot pattern corresponds to a specific correlation type. Recognizing these patterns visually is the first analytical skill; measuring them numerically comes next.

↗️

Strong Positive

r = 0.70 to 1.00

Points rise steeply and cluster tightly. As X increases, Y increases consistently. Example: height and weight.

📈

Moderate Positive

r = 0.40 to 0.69

Clear upward trend but with notable scatter. Example: income and education level.

⬛

No Correlation

r ≈ 0.00

Random cloud of points — no discernible trend in either direction. Example: shoe size and IQ score.

📉

Moderate Negative

r = −0.40 to −0.69

Downward trend with scatter. Example: stress level and job satisfaction.

↘️

Strong Negative

r = −0.70 to −1.00

Points fall steeply and cluster tightly. As X increases, Y decreases. Example: temperature and heating costs.

Non-Linear Correlation: The Hidden Trap

Pearson's r measures only linear association. A dataset can have a near-perfect curved (quadratic or exponential) relationship and still return r ≈ 0, because the upward and downward portions cancel each other in the formula. This is exactly what Anscombe's Quartet demonstrates — covered in detail in Section 6.

⚠️

Non-Linear Warning

Always plot your data before computing r. A value of r = 0.05 does not mean no relationship exists — it means no linear relationship. The true pattern could be quadratic, sinusoidal, or otherwise curved. For non-linear data, use Spearman's ρ or fit a non-linear model.

3. Pearson's Correlation Coefficient: The Formula

Pearson's r is the standard measure of linear correlation. It takes every pair of data points, measures how far each falls from its respective mean, multiplies those deviations together, and then normalizes by the spread of both variables. The result is always a number between −1 and +1.

Pearson's Correlation Coefficient

\( r = \dfrac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{(n-1)\cdot s_x \cdot s_y} \)

Equivalent computational form:

\( r = \dfrac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{\bigl[n\sum x_i^2 - \bigl(\sum x_i\bigr)^2\bigr]\bigl[n\sum y_i^2 - \bigl(\sum y_i\bigr)^2\bigr]}} \)

xᵢ, yᵢ = individual values x̄, ȳ = sample means sₓ, sᵧ = sample standard deviations n = number of data pairs r ∈ [−1, +1]

Step-by-Step Calculation: Pearson's r for 5 Students

Using the study hours and exam score dataset plotted above. The five students provide five (x, y) pairs. Work through each column before computing the final ratio.

Worked Calculation — Pearson's r

Dataset: Hours Studied (X) vs. Exam Score (Y), n = 5. Find r.

Student	Hours (xᵢ)	Score (yᵢ)	xᵢ − x̄	yᵢ − ȳ	(xᵢ−x̄)(yᵢ−ȳ)	(xᵢ−x̄)²	(yᵢ−ȳ)²
A	2	55	−3.2	−22.4	71.68	10.24	501.76
B	4	70	−1.2	−7.4	8.88	1.44	54.76
C	5	75	−0.2	−2.4	0.48	0.04	5.76
D	7	85	1.8	7.6	13.68	3.24	57.76
E	8	92	2.8	14.6	40.88	7.84	213.16
Mean / Σ	x̄ = 5.2	ȳ = 75.4	—	—	Σ = 135.6	Σ = 22.8	Σ = 833.2

Compute sₓ: \( s_x = \sqrt{\frac{22.8}{4}} = \sqrt{5.7} \approx 2.387 \)

Compute sᵧ: \( s_y = \sqrt{\frac{833.2}{4}} = \sqrt{208.3} \approx 14.432 \)

Apply the formula: \( r = \frac{135.6}{4 \times 2.387 \times 14.432} = \frac{135.6}{137.8} \approx \mathbf{0.984} \)

✓ r = 0.984. This indicates a very strong positive linear correlation between study hours and exam scores — very close to a perfect linear relationship of r = 1.0.

Interpreting the Strength of r

\|r\| Value	Strength	Visual on Plot	Typical Context
0.00 – 0.19	Very weak / negligible	Near-random scatter	Exploratory, large datasets
0.20 – 0.39	Weak	Faint trend visible	Social science pilot studies
0.40 – 0.59	Moderate	Clear trend, wide spread	Psychology, economics
0.60 – 0.79	Strong	Narrow scatter around trend	Clinical and behavioral research
0.80 – 1.00	Very strong	Tight, near-linear cluster	Physical sciences, engineering

Benchmarks adapted from Evans (1996). "Strength" thresholds vary by field — physics routinely demands |r| > 0.99 while psychology considers |r| > 0.50 meaningful. Always interpret r in context.

4. R-Squared: Explained Variance

Once you have Pearson's r, one calculation gives you a more interpretable number: \( R^2 = r^2 \). Called the coefficient of determination, R² measures the proportion of variation in Y that the X variable accounts for.

Coefficient of Determination

\( R^2 = r^2 \)

For r = 0.984: \( R^2 = 0.984^2 = 0.968 \)

Interpretation: 96.8% of the variance in exam scores is explained by study hours.

The remaining 3.2% is explained by other factors — prior knowledge, sleep the night before, test anxiety, or random chance. R² gives you an honest account of how much explanatory work your X variable is doing, expressed as a plain percentage.

r = 0.5

R² = 25% explained

r = 0.7

R² = 49% explained

r = 0.9

R² = 81% explained

r = 0.984

R² = 96.8% explained

ℹ️

R² in Plain Language

Think of R² as how much of the "story" X tells about Y. An R² of 0.25 means X explains one-quarter of why Y values differ across observations. Three-quarters of that variation comes from somewhere else — which you have not measured yet.

5. Statistical Significance: The p-Value

Computing r from a sample answers one question: how correlated are these particular observations? But you often want to know whether the correlation you found is likely to reflect a real relationship in the broader population — or whether it could have appeared in a random sample even if the true population correlation is zero.

The p-value for Pearson's r tests the null hypothesis H₀: the population correlation ρ = 0. The test statistic follows a t-distribution with n − 2 degrees of freedom:

t-test for Correlation Significance

\( t = r\sqrt{\dfrac{n-2}{1-r^2}} \)

Then find p-value from the t-distribution with df = n − 2

For our 5-student example: \( t = 0.984\sqrt{\frac{3}{1-0.968}} = 0.984 \times 9.68 = 9.53 \), df = 3, which gives p < 0.01 — statistically significant despite only five observations, because the correlation is so strong.

p-value	Decision	Meaning
p < 0.001	Reject H₀	Highly significant — very unlikely to occur by chance
0.001 ≤ p < 0.01	Reject H₀	Very significant
0.01 ≤ p < 0.05	Reject H₀	Statistically significant at α = 0.05
p ≥ 0.05	Fail to reject H₀	Not statistically significant — insufficient evidence

⚠️

p-Value ≠ Effect Size

With a large enough sample, even r = 0.05 can produce p < 0.05. Statistical significance does not mean practical importance. Always report both r and p together — the correlation tells you the effect size; the p-value tells you how much to trust it given your sample. For further grounding in hypothesis testing logic, see the hypothesis testing guide.

6. Correlation vs. Causation — The Most Misunderstood Distinction in Statistics ⭐

"Correlation is a necessary but not sufficient condition for causation. Causation requires temporal precedence, a plausible mechanism, and the ruling out of confounds." — Judea Pearl, Causality (2009)

Scatter plots show you that two variables move together. They cannot tell you why. There are four distinct reasons any correlation can appear in data, and only one of them involves genuine causation.

✓ True Causation

X directly produces the change in Y through a plausible biological, physical, or social mechanism. Example: Smoking and lung cancer — established through decades of randomized trials and biological mechanism research (DNA damage from carcinogens). Correlation alone was the first signal; mechanism research confirmed causation.

⚠ Confounding Variable (Third Variable)

A hidden variable Z causes both X and Y independently. Example: Ice cream sales and drowning rates are positively correlated — not because ice cream causes drowning, but because summer heat (Z) independently drives both. Remove the summer months and the correlation drops toward zero.

↩ Reverse Causation

Y actually causes X, not the other way around. Example: Hospital data shows patients in intensive care units have higher mortality rates. Do ICUs cause death? No — sicker patients are admitted to ICUs. The causal arrow runs in the opposite direction from the correlation's implication.

✕ Spurious Correlation (Pure Coincidence)

Both variables happen to share a trend over time with no causal connection whatsoever. Tyler Vigen (2015) documented that per-capita cheese consumption in the United States correlates at r = 0.947 with deaths by bedsheet-tangling. No mechanism; no confound; just two upward-trending time series that look similar over the same period.

Three Real-World Case Studies

Case Study 1

Shoe Size and Reading Ability in Children

Studies find a positive correlation between children's shoe size and reading test scores. The explanation is not that large feet improve literacy. Age is the confounding variable: older children have larger feet and more reading practice. Control for age and the correlation disappears entirely. This is a textbook third-variable problem.

Case Study 2

Social Media Use and Teen Anxiety

Widely reported as a causal relationship, but Orben and Przybylski (2019, Nature Human Behaviour) ran large preregistered analyses and found effect sizes comparable to wearing glasses or eating potatoes — small enough that the causal interpretation is scientifically contested. Bidirectional causation is plausible (anxious teens may seek social media more), as are several confounders. The scatter plot correlation is real; the causal story is not settled.

Case Study 3

Income and Life Expectancy Across Countries

The correlation is strong and real (r ≈ 0.82 in cross-national data). A plausible causal pathway exists — higher income enables better nutrition, healthcare, and safety. But the full picture includes education as a mediator, policy as a moderator, and historical wealth distribution as a confound. Correlation launched the research question; decades of careful causal inference work continues to refine what the relationship actually means.

Anscombe's Quartet — Why Visualization Is Mandatory

Francis Anscombe (1973) constructed four datasets that expose a critical limitation of relying on r alone. All four share nearly identical summary statistics, yet they look completely different when plotted.

Summary Statistic	Dataset I	Dataset II	Dataset III	Dataset IV
Mean of X	9.0	9.0	9.0	9.0
Mean of Y	7.50	7.50	7.50	7.50
Variance of X	11.0	11.0	11.0	11.0
Variance of Y	4.12	4.12	4.12	4.12
Pearson's r	0.816	0.816	0.816	0.816
Regression line	y = 3 + 0.5x	y = 3 + 0.5x	y = 3 + 0.5x	y = 3 + 0.5x
True pattern	Linear ✓	Quadratic curve	One outlier inflating r	Single leverage point

The lesson: a linear model is appropriate only for Dataset I. In Dataset II, the relationship is quadratic and fitting a line is the wrong model. In Dataset III, one outlier artificially inflates r — remove it and the correlation near-disappears. In Dataset IV, a single high-leverage data point creates the entire apparent correlation. Reporting r = 0.816 for all four datasets without plotting would produce four identical analyses for four structurally different situations.

🚫

The Rule: Always Plot Before You Compute

Anscombe (1973) and later datasets like the "datasaurus dozen" (Matejka & Fitzmaurice, 2017) prove the same point: summary statistics including r are insufficient without their scatter plot. This is not optional good practice — it is a prerequisite for valid analysis. Source: Anscombe, F.J. (1973). "Graphs in Statistical Analysis." The American Statistician, 27(1), 17–21.

7. How to Create Scatter Plots and Calculate Correlation

Interactive Pearson's r Calculator

🔢 Pearson's r Calculator

Enter paired values separated by commas. X and Y must have the same number of values.

X values (comma-separated)

Y values (comma-separated)

Python: scipy.stats.pearsonr()

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

x = [2, 4, 5, 7, 8]   # Hours studied
y = [55, 70, 75, 85, 92]  # Exam scores

# Pearson's r and two-tailed p-value
r, p_value = stats.pearsonr(x, y)
print(f"Pearson's r = {r:.3f}")     # r = 0.984
print(f"p-value     = {p_value:.4f}") # p = 0.0021
print(f"R-squared   = {r**2:.3f}")    # R² = 0.968

# Scatter plot with regression line
m, b = np.polyfit(x, y, 1)
plt.scatter(x, y, color='steelblue', s=80, zorder=3)
plt.plot(x, [m*xi + b for xi in x], color='tomato', linewidth=2)
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")
plt.title(f"Scatter Plot: r = {r:.3f}, R² = {r**2:.3f}")
plt.show()

# Anscombe's Quartet — always plot first
import seaborn as sns
anscombe = sns.load_dataset("anscombe")
sns.lmplot(x="x", y="y", col="dataset", data=anscombe, col_wrap=2)
        

R: cor() and ggplot2

# Base R calculation
x <- c(2, 4, 5, 7, 8)
y <- c(55, 70, 75, 85, 92)

r <- cor(x, y, method = "pearson")
cat("r =", r, "\n")                   # r = 0.9842
cat("R² =", r^2, "\n")               # R² = 0.9687

# Full significance test
cor.test(x, y, method = "pearson")   # returns t, df, p-value, 95% CI

# ggplot2 scatter with regression line
library(ggplot2)
df <- data.frame(x = x, y = y)
ggplot(df, aes(x = x, y = y)) +
  geom_point(size = 4, color = "steelblue") +
  geom_smooth(method = "lm", color = "tomato", se = TRUE) +
  labs(title = paste("r =", round(r, 3)),
       x = "Hours Studied", y = "Exam Score")
        

Excel: CORREL() and Scatter Chart

// Enter X values in A1:A5, Y values in B1:B5
// Pearson's r:
=CORREL(A1:A5, B1:B5)           // returns 0.9842

// R-squared:
=CORREL(A1:A5, B1:B5)^2         // returns 0.9687

// Spearman's rho (rank correlation):
=CORREL(RANK(A1:A5,A1:A5), RANK(B1:B5,B1:B5))

// To build a scatter chart:
// 1. Select A1:B5  2. Insert → Charts → Scatter
// 3. Click chart → Add Trendline → Linear → Display R²
        

Tool	Correlation Function	Free?
Python (scipy)	stats.pearsonr(x, y)	Yes
R	cor(x, y, method="pearson")	Yes
Excel	=CORREL(array1, array2)	Paid (free online)
SPSS	Analyze → Correlate → Bivariate	Paid
Tableau	Built-in scatter + trend line	Freemium

🧠 Knowledge Check

Three questions. Instant feedback. No login required.

Q1. A scatter plot shows tightly clustered points rising steeply from left to right. Which Pearson's r value is most consistent with this pattern?

Q2. Countries with more televisions per household show higher life expectancy (r = 0.79). Which explanation is most statistically defensible?

Q3. For a dataset with Pearson's r = 0.6, what percentage of the variance in Y is explained by X?

Common Scatter Plot and Correlation Mistakes

#	Mistake	Correct Approach
1	Computing r and stopping — never looking at the scatter plot	Plot first, always. Anscombe's Quartet proves that identical r values can represent four different data structures.
2	Treating a statistically significant r as proof of practical importance	With n = 10,000, even r = 0.03 reaches p < 0.05. Report r alongside p — the effect size matters more than significance alone.
3	Concluding that correlation implies causation	Causation requires a plausible mechanism, temporal precedence, and ruling out confounds. Correlation is a necessary first step, not a sufficient conclusion.
4	Using Pearson's r on non-linear or ordinal data	Use Spearman's ρ for ranked/ordinal data or when the relationship is monotonic but not linear. Verify linearity with a scatter plot first.
5	Ignoring outliers when they are visible on the scatter plot	A single outlier can shift r by 0.1–0.3 in small samples. Report r with and without the outlier, and investigate its source before deciding whether to include it.

Frequently Asked Questions

Formula Reference — Scatter Plots & Correlation

The table below condenses every key formula from this guide into one scannable reference, structured for both human study and LLM extraction.

Formula / Term	Expression	Range	What It Measures
Pearson's r	\( r = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_x s_y} \)	−1 to +1	Strength and direction of linear relationship
Coefficient of Determination	\( R^2 = r^2 \)	0 to 1	Proportion of variance in Y explained by X
t-statistic for r	\( t = r\sqrt{\frac{n-2}{1-r^2}} \)	−∞ to +∞	Tests H₀: ρ = 0; df = n − 2
Spearman's ρ	\( \rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)} \)	−1 to +1	Monotonic relationship (non-parametric)
Covariance	\( s_{xy} = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n-1} \)	Unbounded	Direction of joint variability (unnormalized)
Standard Error of r	\( SE_r = \sqrt{\frac{1-r^2}{n-2}} \)	≥ 0	Sampling variability of the r estimate
r = +1.0	Perfect positive linear	—	All points lie exactly on an upward line
r = 0.0	No linear correlation	—	Knowing X gives no linear information about Y
r = −1.0	Perfect negative linear	—	All points lie exactly on a downward line