BY: Statistics Fundamentals Team
Reviewed By: Minsa A (Senior Statistics Editor)

Regression Line Scatter Plot

Create publication-ready regression scatter plots from your X,Y data. The tool draws the scatter plot, fits the least-squares regression line, plots confidence and prediction bands, and produces a complete regression analysis: R², ANOVA table, t-test on slope, residuals table, and step-by-step calculations.

Regression Scatter Plot Maker

Equation ŷ = b₀ + b₁x Least squares minimizes Σ(yᵢ − ŷᵢ)²
CI for mean ŷ ± t* · SE · √(1/n + (x−x̄)²/Sxx)

Generate the regression plot first, then enter an X value below to get the predicted Ŷ with confidence interval (for mean response) and prediction interval (for individual observation).

Generate the plot first. The step-by-step calculation will appear here.

Generate a regression plot to see step-by-step calculations

Regression Examples

Click any example to load it into the tool

View:

What Is Simple Linear Regression?

Simple linear regression fits a straight line through a set of X,Y data points that minimizes the sum of squared residuals — the squared vertical distances between each observed Y and the predicted Ŷ on the line. The resulting line is called the least-squares regression line, and its equation is ŷ = b₀ + b₁x, where b₀ is the y-intercept and b₁ is the slope.

According to the NIST Engineering Statistics Handbook, linear regression is one of the most widely used statistical methods in science, engineering, economics, and the social sciences. It serves two primary purposes: explaining the relationship between variables (inference) and predicting future Y values for given X values (prediction).

Understanding the Regression Equation

Slope (b₁): The amount Y changes for every 1-unit increase in X. A slope of 2.3 means "on average, Y increases by 2.3 units when X increases by 1." Calculated as b₁ = Sxy/Sxx, where Sxy = Σ(xᵢ−x̄)(yᵢ−ȳ) and Sxx = Σ(xᵢ−x̄)².
Y-intercept (b₀): The predicted value of Y when X = 0. Calculated as b₀ = ȳ − b₁x̄. The intercept is not always meaningful — if X = 0 is outside the data range, the intercept is a mathematical artifact and should not be interpreted literally.
R² (coefficient of determination): The proportion of variance in Y explained by the linear model. R² = SSR/SST = 1 − SSE/SST. An R² of 0.84 means 84% of the variation in Y is explained by the regression model. The remaining 16% is unexplained residual variation.
Standard error of estimate (Sₑ): The average distance between observed Y values and the regression line, in the same units as Y. Sₑ = √(SSE/(n−2)). Smaller Sₑ means the line fits closer to the data. Used in confidence intervals and prediction intervals.

Confidence Bands vs Prediction Intervals

Interval typeEstimatesWidthFormula
Confidence band (CI)Average Y for all observations at X = x*Narrower; approaches 0 as n→∞ŷ ± t* · Sₑ · √(1/n + (x*−x̄)²/Sxx)
Prediction interval (PI)Individual new observation at X = x*Always wider than CIŷ ± t* · Sₑ · √(1 + 1/n + (x*−x̄)²/Sxx)

Both intervals widen as X moves further from the mean x̄ — this is called the "bowtie" shape of the confidence band. The further you predict from the center of the data, the less reliable the estimate.

Testing Whether the Slope Is Significant

The t-test for the slope tests H₀: β₁ = 0 (no linear relationship) against H₁: β₁ ≠ 0. The test statistic is t = b₁/Sb₁, where Sb₁ = Sₑ/√Sxx is the standard error of the slope. With df = n−2, compare to the critical t value. If |t| exceeds the critical value (or p < α), reject H₀ and conclude the slope is significantly different from zero — evidence of a linear relationship.

Regression Assumptions (LINE)

L — Linearity: The relationship between X and Y is truly linear. Check by looking at the scatter plot — is a straight line reasonable? A curved pattern suggests a non-linear model (polynomial, logarithmic) may fit better.
I — Independence: Observations are independent of each other. Violated in time-series data (autocorrelation). Check with the Durbin-Watson test or by plotting residuals in order.
N — Normality of residuals: Residuals eᵢ = yᵢ − ŷᵢ should follow a normal distribution. Check with a normal probability plot (Q-Q plot) of residuals or the Shapiro-Wilk test.
E — Equal variance (homoscedasticity): The spread of residuals should be roughly constant across all values of X. Fan-shaped patterns in the residual plot indicate heteroscedasticity — a violation requiring transformation or weighted regression.

Related Topics

Sources & further reading:

  • NIST Engineering Statistics Handbook — Linear Regression
  • Montgomery, D.C., Peck, E.A., & Vining, G.G. (2012). Introduction to Linear Regression Analysis, 5th ed. Wiley.
  • Kutner, M.H. et al. (2004). Applied Linear Statistical Models, 5th ed. McGraw-Hill. [Standard reference for regression diagnostics]

Frequently Asked Questions

Correlation (r) measures the strength and direction of the linear relationship between X and Y — it's a dimensionless number between −1 and +1. Regression goes further: it finds the equation of the line (ŷ = b₀ + b₁x) that best describes how Y changes as X changes, and allows you to make predictions. Correlation is symmetric (r of X on Y = r of Y on X); regression is not (swapping X and Y gives a different line).

R² measures the proportion of variance in Y that is explained by the linear model. R² = 0.75 means 75% of the variation in Y is accounted for by the regression line; the remaining 25% is unexplained. R² ranges from 0 (the line explains nothing) to 1 (perfect fit). Context matters — R² of 0.3 can be meaningful in social science; R² of 0.95 might be expected in controlled lab experiments.

The slope b₁ tells you: for every 1-unit increase in X, Y is predicted to change by b₁ units on average, holding other variables constant. If b₁ = 2.5 and X is "hours studied" and Y is "exam score," then each additional hour of study is associated with 2.5 more points on average. The slope is significant only if the p-value for the t-test on b₁ is less than α (typically 0.05).

A residual is the difference between the observed Y and the predicted Ŷ: eᵢ = yᵢ − ŷᵢ. Positive residuals mean the actual value was above the line; negative residuals mean it was below. In a good regression model, residuals should be randomly scattered around zero with no pattern. Systematic patterns in residuals (curves, fan shapes, outliers) suggest model violations that should be investigated.

Extrapolation means predicting Y for an X value outside the range of the original data. Regression lines are only validated within the observed X range — extending the line beyond assumes the linear relationship continues indefinitely, which is rarely true. Prediction intervals grow very wide outside the data range, and the linear pattern may break down. Use the Predict tab but stay within the data's X range for reliable estimates.

The ANOVA table partitions the total variance in Y into two components: SSR (variance explained by the regression model) and SSE (unexplained residual variance). SST = SSR + SSE. The F-statistic = MSR/MSE = (SSR/1)/(SSE/(n−2)) tests whether the overall regression model is significant. A large F with small p-value means the linear model explains a significant portion of the variance in Y.