📊

Linear Regression Overview

Jul 24, 2025

Overview

This lecture introduces the main ideas of linear regression, focusing on fitting lines to data, calculating r squared, and determining statistical significance with p-values.

Introduction to Linear Regression

  • Linear regression (general linear models) is a method for modeling relationships between variables.
  • The three main steps are fitting a line using least squares, calculating r squared, and determining a p-value for r squared.

Fitting a Line Using Least Squares

  • Least squares fitting involves finding a line through data points that minimizes the sum of squared residuals.
  • A residual is the distance from a data point to the fitted line.
  • The process involves rotating the line to minimize the sum of squared residuals.

Calculating r Squared (R²)

  • R squared measures how much variation in the dependent variable (e.g., mouse size) can be explained by the independent variable (e.g., mouse weight).
  • r squared = (variation around the mean – variation around the fit) / variation around the mean.
  • Example: If r squared = 0.6, then 60% of the variation is explained by the model.
  • Using more variables (e.g., tail length) involves fitting a plane and estimating more parameters.

Adjusted R Squared and Overfitting

  • Adding parameters never worsens the fit because least squares sets non-informative parameters to zero.
  • Adjusted r squared corrects for the number of parameters to avoid overfitting.

Calculating Statistical Significance (p-value)

  • The F statistic is used to assess if the r squared is significant.
  • F = (variation explained by extra parameters) / (variation not explained), adjusted by degrees of freedom.
  • Degrees of freedom account for the number of parameters estimated.
  • The p-value is the proportion of random F values as extreme as the observed F, typically visualized using the F-distribution.

Summary

  • Linear regression quantifies relationships with r squared (should be high) and tests their significance with a p-value (should be low).
  • Both a large r squared and a small p-value are needed for meaningful results.

Key Terms & Definitions

  • Residual — The distance between a data point and the fitted line or plane.
  • Least Squares — Method to fit a line by minimizing the sum of squared residuals.
  • r Squared (R²) — Proportion of variance in the dependent variable explained by the model.
  • Sum of Squares Around the Mean (SS mean) — Sum of squared differences between data points and their mean.
  • Sum of Squares Around the Fit (SS fit) — Sum of squared differences between data points and the fitted line.
  • Degrees of Freedom — Number of independent values used to estimate parameters.
  • F Statistic — Ratio used to determine if the explained variance is statistically significant.
  • Adjusted R Squared — R squared adjusted for the number of predictors in the model.

Action Items / Next Steps

  • Review slides or materials on least squares fitting and r squared.
  • Practice calculating r squared and p-values for small datasets.
  • Read up on adjusted r squared and the F distribution for deeper understanding.