Understanding Linear Regression Concepts

Sep 22, 2024

Lecture Notes on Linear Regression

Overview of Linear Regression

  • Assumption: Input (X) from some distribution P, Output (Y) from real numbers R.
  • Input Matrix (X): Can be of dimensions n x (3 + 1) or n x P.
    • n x (3 + 1) is used when there is an explicit intercept (beta naught).
    • A column of ones is added to account for the intercept.
    • When no intercept is used, it becomes a p-dimensional input.

Minimizing Sum Squared Error

  • Focus on minimizing the sum squared error in regression tasks.
  • Reviewed simple linear regression and multiple inputs.
  • Importance of interpreting multiple inputs in terms of univariate regression.

Drawbacks of Linear Regression

  1. Low Variance: While least squares fitting provides a low variance, it can have high bias.
  2. Bias-Variance Trade-off:
    • Least squares gives zero bias if linear model is correct.
    • Introducing constraints can reduce variance, but increase bias.
    • The goal is to reduce the complexity (number of variables) to achieve better prediction accuracy.

Subset Selection in Linear Regression

  • Subset Selection: Choosing a subset of input variables to fit the model.
    • Reduces variance and improves prediction accuracy.
    • Interpretability improves by focusing on fewer variables.

Techniques for Subset Selection

  1. Combinatorial Selection: Exhaustive search for best subsets is computationally expensive.
  2. Algorithms for Efficient Selection:
    • Leaps and Bounds: Efficient algorithm for subset selection.
    • Forward Stepwise Selection: Greedy approach to adding variables one at a time based on fit improvement.
      • Start with the intercept, add variables that improve the fit without disturbing selected variables.
      • Stop if no remaining variables improve the residual error.
  3. Backward Elimination: Start with all variables and remove one at a time.
    • Works if the number of data points (n) > number of dimensions (P).
    • Greediness may not always yield the best fit, hence hybrid approaches might be beneficial.

Practical Applications and Observations

  • Many statistical packages utilize forward stepwise selection due to its effectiveness on various datasets.
  • Greedy approaches like forward selection may perform comparably to more exhaustive methods in real-world scenarios.