BY: Statistics Fundamentals Team
Reviewed By: Minsa A (Senior Statistics Editor)

Simple Linear Regression Calculator

Find the line of best fit for any paired dataset using the least squares method. Enter your X and Y values, and the calculator returns the regression equation ŷ = mx + b, slope, y-intercept, R² value, Pearson's r, and a residuals table — with full step-by-step working.

Simple Linear Regression Calculator

Equation ŷ = β₀ + β₁x Slope β₁ = (nΣXY − ΣXΣY) / (nΣX² − (ΣX)²)

Enter one X and Y pair per row, separated by a comma, space, or tab. You need at least 3 data pairs.

One value per line (predictor variable)
One value per line (response variable)
Formula ŷ = β₀ + β₁ ⋅ X Status Run Regression tab first

First calculate regression in the left tab, then enter any X value here to get the predicted Y. The equation ŷ = β₀ + β₁x is applied automatically.

Enter the independent variable value

What Is Simple Linear Regression?

Simple linear regression is a method for modeling the straight-line relationship between one independent variable (X) and one continuous dependent variable (Y). The goal is to find the single line that minimizes the total squared distance between each observed data point and the line itself — a technique called ordinary least squares (OLS).

Think of how gas mileage drops as vehicle weight increases, or how monthly energy bills climb as winter temperatures fall. In both cases, one numeric variable reliably tracks another in a roughly straight-line pattern. Regression puts a precise mathematical equation on that pattern, so you can predict an unknown Y value from any given X. The resulting line is called the line of best fit, or the regression line.

According to the NIST/SEMATECH Engineering Statistics Handbook, linear regression is the most widely applied modeling technique in statistics because its outputs — slope, intercept, and R² — are interpretable without specialist training, yet the method underlies everything from econometric forecasts to clinical dose-response modeling.

Scatter plot with simple linear regression line showing slope, intercept, and residuals labeled.

The Simple Linear Regression Formula

The regression equation takes the form ŷ = β₀ + β₁x, where β₁ is the slope and β₀ is the y-intercept. Both are derived from your data using least squares formulas that no amount of eyeballing or guessing can match for accuracy.

Slope (β₁)

β₁ = [nΣXY − ΣX ⋅ ΣY] / [nΣX² − (ΣX)²] n = number of data pairs XY = product of each pair X² = square of each X value

Y-Intercept (β₀)

β₀ = (ΣY − β₁ΣX) / n Equivalently: β₀ = Ȳ − β₁X̄ X̄ = mean of X values Ȳ = mean of Y values

Coefficient of Determination (R²)

R² = 1 − SSE / SST SSE = Σ(Y − Ŷ)² (residuals SS) SST = Σ(Y − Ȳ)² (total SS) SSR = SST − SSE (regression SS) Range: 0 ≤ R² ≤ 1

Pearson's Correlation (r)

r = [nΣXY − ΣXΣY] / √{[nΣX²−(ΣX)²][nΣY²−(ΣY)²]} Range: −1 ≤ r ≤ +1 R² = r² (always) sign(r) = sign(β₁)

The slope β₁ tells you how many units Y is expected to increase for each one-unit increase in X. A slope of 150 in a house-price model means each additional square foot is associated with $150 more in predicted price. The intercept β₀ is Y when X = 0 — sometimes interpretable (as a baseline cost), sometimes not (a house with zero square feet has no meaning, so β₀ is a mathematical anchor rather than a real prediction).

Step-by-Step Worked Example: Real Estate Pricing

Problem: A real estate analyst collects data on five properties. Does square footage (X) linearly predict sale price (Y)? Find the regression equation, R², and the predicted price for a 1,900 sq ft home.

The dataset below maps square footage to sale price. You can load this exact example into the calculator above using the "Example (Real Estate)" button.

Property X (sq ft) Y (Price $) XY Ŷ (Predicted) Residual e
House 11,200200,0001,440,000240,000,000197,500+2,500
House 21,500245,0002,250,000367,500,000242,500+2,500
House 31,800290,0003,240,000522,000,000287,500+2,500
House 42,100335,0004,410,000703,500,000332,500+2,500
House 52,400380,0005,760,000912,000,000377,500+2,500
Totals 9,000 1,450,000 17,100,000 2,745,000,000

n = 5, ΣX = 9,000, ΣY = 1,450,000, ΣX² = 17,100,000, ΣXY = 2,745,000,000

Step 1 — Calculate the slope (β₁)

β₁ = [5 × 2,745,000,000 − 9,000 × 1,450,000] / [5 × 17,100,000 − 9,000²]
= [13,725,000,000 − 13,050,000,000] / [85,500,000 − 81,000,000]
= 675,000,000 / 4,500,000 = 150

Step 2 — Calculate the y-intercept (β₀)

X̄ = 9,000 / 5 = 1,800    Ȳ = 1,450,000 / 5 = 290,000
β₀ = Ȳ − β₁ ⋅ X̄ = 290,000 − 150 × 1,800 = 290,000 − 270,000 = 20,000

Step 3 — Write the regression equation

ŷ = 20,000 + 150x
Each additional square foot adds $150 to predicted price, with a baseline of $20,000.

Step 4 — Compute R²

SST = (200,000−290,000)² + (245,000−290,000)² + … = 28,250,000,000
SSE = 2,500² × 5 = 31,250,000
R² = 1 − 31,250,000 / 28,250,000,000 ≈ 0.9989

Step 5 — Predict price for 1,900 sq ft

ŷ = 20,000 + 150 × 1,900 = 20,000 + 285,000 = $305,000

Result: Regression equation is ŷ = 20,000 + 150x. R² = 0.9989 means square footage explains 99.89% of the variance in price — a near-perfect linear fit in this small, uniform dataset. A 1,900 sq ft home is predicted to sell for $305,000. Use the "Example (Real Estate)" button in the calculator above to verify each step.

Interpreting Your Results: R², Pearson's r, and Residuals

Three outputs carry most of the interpretive weight in a simple linear regression: the equation itself, R², and the residual pattern. Getting all three right is what separates a useful model from a misleading one.

Residual plot showing randomly distributed residuals around zero, indicating homoscedasticity.
Metric Formula What It Measures Interpretation Guide
Regression Equation ŷ = β₀ + β₁x Predicts Y from any X value β₁ is the change in Y per unit of X; β₀ is Y when X = 0
Slope (β₁) (nΣXY−ΣXΣY) / (nΣX²−(ΣX)²) Rate of change in Y per unit X Positive = Y rises as X rises; Negative = inverse relationship
Y-Intercept (β₀) Ȳ − β₁X̄ Predicted Y when X = 0 May not be meaningful if X = 0 is outside the data range
1 − SSE/SST Proportion of Y variance explained by X 0 = no fit; 1 = perfect fit; context determines what is "good"
Pearson's r √R² ⋅ sign(β₁) Strength and direction of linear association |r| > 0.7 = strong; 0.4–0.7 = moderate; <0.4 = weak
Residual (e) Y − Ŷ Prediction error for each observation Should be random; patterns suggest violated assumptions
SSE Σ(Y−Ŷ)² Total unexplained variation Lower is better; used to calculate R² and standard error
SST Σ(Y−Ȳ)² Total variation in Y Baseline; SSR + SSE always equals SST

Difference Between r and R²

Pearson's r and R² both measure model fit, but they serve different questions. An r of +0.91 tells you the relationship is strong and positive. R² = 0.83 (which is 0.91²) tells you the model accounts for 83% of the variance in Y — leaving 17% unexplained by X. In simple linear regression these two always relate by R² = r². In multiple regression, R² still works the same way but r no longer applies the same connection.

Assumptions of Simple Linear Regression (LINE)

Regression results are only trustworthy when the four assumptions hold. Statisticians use the acronym LINE to remember them. Violating these doesn't always break the model — but it does change which conclusions you can draw.

L
Linearity

The relationship between X and Y is linear — a straight line accurately represents the pattern. Check with a scatter plot before running regression. If the cloud of points curves, a linear model will miss the shape of the data.

I
Independence

Observations are independent of one another. Violated in time-series data (today's value depends on yesterday's) and in clustered samples. Independence cannot be checked from the data alone — it depends on your study design.

N
Normality

The residuals — not the raw data — are approximately normally distributed. This matters most for small samples when using the model for inference (confidence intervals, p-values). Large samples are robust by the Central Limit Theorem.

E
Equal Variance (Homoscedasticity)

Residual variance stays constant across all X values. A funnel shape in a residual plot — where scatter fans out as X increases — signals heteroscedasticity. Weighted least squares or log-transformation often fixes this.

Penn State's STAT 501 course, available via online.stat.psu.edu, provides a thorough walkthrough of diagnosing each assumption using residual plots, Q-Q plots, and the Durbin-Watson test for independence. Their materials are widely cited in applied statistics curricula for good reason — the worked examples track industrial and biological datasets where at least one assumption is always slightly bent.

Simple Linear Regression in Python, Excel, and R

For analysts working with larger datasets, the following code snippets replicate exactly what this calculator computes — useful for verifying outputs before integrating regression into a larger workflow.

Python (NumPy / SciPy / Scikit-Learn)

import numpy as np
from scipy import stats

x = np.array([1200, 1500, 1800, 2100, 2400])
y = np.array([200000, 245000, 290000, 335000, 380000])

slope, intercept, r, p_value, std_err = stats.linregress(x, y)

print(f"Slope (β₁):     {slope:.4f}")        # 150.0
print(f"Intercept (β₀): {intercept:.4f}")    # 20000.0
print(f"R² value:       {r**2:.4f}")         # 0.9989
print(f"Pearson r:      {r:.4f}")            # 0.9994
print(f"Predicted (1900 sqft): {intercept + slope * 1900:.0f}")  # 305000

Microsoft Excel (LINEST function)

=LINEST(B2:B6, A2:A6, TRUE, TRUE)   // Returns full stats array: slope, intercept, R², etc.

// Individual outputs:
=SLOPE(B2:B6, A2:A6)               // Slope β₁  → 150
=INTERCEPT(B2:B6, A2:A6)           // Intercept β₀ → 20000
=RSQ(B2:B6, A2:A6)                 // R² → 0.9989
=CORREL(A2:A6, B2:B6)              // Pearson r → 0.9994
=FORECAST.LINEAR(1900, B2:B6, A2:A6)  // Predicted Y → 305000

Note: Excel's LINEST vs. this calculator agrees to all decimal places on slope and intercept. The calculator above vs. =LINEST() is a useful cross-check students can run side by side for any homework dataset.

R

x <- c(1200, 1500, 1800, 2100, 2400)
y <- c(200000, 245000, 290000, 335000, 380000)

model <- lm(y ~ x)
summary(model)
# Coefficients: (Intercept) = 20000, x = 150
# R-squared: 0.9989

predict(model, newdata = data.frame(x = 1900))  # 305000
plot(x, y); abline(model, col = "blue")         # Scatter + regression line

Simple vs. Multiple Linear Regression: When to Use Which

The choice between simple and multiple regression comes down to how many independent variables you need to describe reality accurately. Simple regression works when one predictor genuinely captures most of the variation. Once a second variable adds meaningful explanatory power — and is not just noise — multiple regression is the appropriate extension.

FeatureSimple Linear RegressionMultiple Linear Regression
PredictorsOne (X)Two or more (X₁, X₂, …)
Equationŷ = β₀ + β₁xŷ = β₀ + β₁x₁ + β₂x₂ + …
Slope interpretationChange in Y per unit XChange in Y per unit X₆ holding all other X constant
Fit metricAdjusted R² (penalizes extra predictors)
Typical useInitial exploration; one clear driverReal-world prediction where multiple factors matter
New riskOmitted variable biasMulticollinearity between predictors

For a full treatment of extending the model beyond one predictor, see the multiple linear regression guide on this site. When your outcome is categorical rather than continuous, logistic regression is the standard choice — it models the log-odds of a binary event rather than a continuous Y.

Related Topics on Statistics Fundamentals

Linear regression connects to probability, hypothesis testing, and visualization. These pages build the surrounding foundation.

Sources and Further Reading

Authority sources cited in this guide:

  • National Institute of Standards and Technology (NIST). NIST/SEMATECH e-Handbook of Statistical Methods — Linear Least Squares Regression. itl.nist.gov
  • Penn State Eberly College of Science. STAT 501: Regression Methods — Lesson 1: Simple Linear Regression. online.stat.psu.edu
  • UCLA Institute for Digital Research and Education. Introduction to Linear Regression Analysis. stats.oarc.ucla.edu
  • MIT OpenCourseWare. 18.650 Statistics for Applications — Regression Models. ocw.mit.edu
  • Montgomery, D.C., Peck, E.A., & Vining, G.G. Introduction to Linear Regression Analysis, 5th ed. Wiley, 2012.
  • Wikipedia contributors. Simple linear regression. en.wikipedia.org

Frequently Asked Questions

The regression equation is ŷ = β₀ + β₁x. The slope is β₁ = [nΣXY − ΣXΣY] / [nΣX² − (ΣX)²] and the y-intercept is β₀ = Ȳ − β₁X̄. Both are derived from your data using the least squares method, which minimizes the sum of squared residuals between observed Y values and the regression line.

R² (the coefficient of determination) is the proportion of total variance in Y that the regression model explains. An R² of 0.85 means 85% of the spread in Y is accounted for by X alone; the remaining 15% is unexplained variation. R² ranges from 0 to 1. Whether a given value is "good" depends on context — physical sciences often see R² above 0.95, while social science models with R² of 0.40 can still be valuable.

Pearson's r is the correlation coefficient — it ranges from −1 to +1 and captures both the strength and direction of the linear relationship. R² is simply r squared (always positive) and tells you the fraction of variance explained. An r of −0.90 means a strong negative relationship; its R² = 0.81 means the model explains 81% of variance regardless of direction. In simple linear regression, R² = r² exactly. In multiple regression, R² extends naturally but r no longer applies the same way.

A residual is the vertical gap between an observed data point and the regression line: e = Y − Ŷ. Points above the line have positive residuals; points below have negative ones. The least squares method minimizes Σe² — the sum of all squared residuals — which is why it produces the best possible straight-line fit. Examining residuals after fitting is essential: patterns (curves, funnels, clusters) mean the linear model isn't capturing all the structure in your data.

The four assumptions (LINE): Linearity — Y and X have a linear relationship, visible in a scatter plot; Independence — observations don't depend on each other (violated in time-series without correction); Normality — residuals are approximately normally distributed (test with a Q-Q plot); Equal variance / Homoscedasticity — residual spread is constant across all X values (a funnel pattern in a residual plot signals a problem). Violating these doesn't always invalidate the model but does limit which inferences you can trust.

Simple linear regression uses one predictor variable: ŷ = β₀ + β₁x. Multiple linear regression uses two or more: ŷ = β₀ + β₁x₁ + β₂x₂ + …. In multiple regression, each slope β₆ represents the effect of that variable while holding all other predictors constant — which changes the interpretation substantially. See the multiple linear regression guide for the full extension with adjusted R² and F-test details.

Outliers at the extremes of X — points with high leverage — can pull the regression line toward them and substantially change the slope. A single high-leverage outlier can inflate R² (making the model look better than it is) or deflate it, depending on where the point falls relative to the true trend. Always examine the scatter plot and residuals table before trusting the model. If a genuine data error is found, remove or correct it. If the outlier is a real observation, consider reporting results with and without it.

No. Regression reveals statistical association — not causation. A high R² means X is a reliable predictor of Y in the observed data, but the causal mechanism could run the other direction, both variables might respond to a third confounding variable, or the pattern could be coincidental. Establishing causation requires either a controlled experiment with random assignment or a well-designed natural experiment. This is why researchers carefully distinguish "X predicts Y" from "X causes Y" when reporting regression findings.