Overview
This lecture introduces the main ideas of linear regression, focusing on fitting lines to data, calculating r squared, and determining statistical significance with p-values.
Introduction to Linear Regression
- Linear regression (general linear models) is a method for modeling relationships between variables.
- The three main steps are fitting a line using least squares, calculating r squared, and determining a p-value for r squared.
Fitting a Line Using Least Squares
- Least squares fitting involves finding a line through data points that minimizes the sum of squared residuals.
- A residual is the distance from a data point to the fitted line.
- The process involves rotating the line to minimize the sum of squared residuals.
Calculating r Squared (R²)
- R squared measures how much variation in the dependent variable (e.g., mouse size) can be explained by the independent variable (e.g., mouse weight).
- r squared = (variation around the mean – variation around the fit) / variation around the mean.
- Example: If r squared = 0.6, then 60% of the variation is explained by the model.
- Using more variables (e.g., tail length) involves fitting a plane and estimating more parameters.
Adjusted R Squared and Overfitting
- Adding parameters never worsens the fit because least squares sets non-informative parameters to zero.
- Adjusted r squared corrects for the number of parameters to avoid overfitting.
Calculating Statistical Significance (p-value)
- The F statistic is used to assess if the r squared is significant.
- F = (variation explained by extra parameters) / (variation not explained), adjusted by degrees of freedom.
- Degrees of freedom account for the number of parameters estimated.
- The p-value is the proportion of random F values as extreme as the observed F, typically visualized using the F-distribution.
Summary
- Linear regression quantifies relationships with r squared (should be high) and tests their significance with a p-value (should be low).
- Both a large r squared and a small p-value are needed for meaningful results.
Key Terms & Definitions
- Residual — The distance between a data point and the fitted line or plane.
- Least Squares — Method to fit a line by minimizing the sum of squared residuals.
- r Squared (R²) — Proportion of variance in the dependent variable explained by the model.
- Sum of Squares Around the Mean (SS mean) — Sum of squared differences between data points and their mean.
- Sum of Squares Around the Fit (SS fit) — Sum of squared differences between data points and the fitted line.
- Degrees of Freedom — Number of independent values used to estimate parameters.
- F Statistic — Ratio used to determine if the explained variance is statistically significant.
- Adjusted R Squared — R squared adjusted for the number of predictors in the model.
Action Items / Next Steps
- Review slides or materials on least squares fitting and r squared.
- Practice calculating r squared and p-values for small datasets.
- Read up on adjusted r squared and the F distribution for deeper understanding.