Linear Regression, Gradient Descent, and Normal Equations

Jul 21, 2024

Linear Regression, Gradient Descent, and Normal Equations

Introduction

  • Focus: Linear regression, gradient descent, and normal equations
  • Importance of foundational concepts for future algorithms

Linear Regression

  • Simplest learning algorithm
  • Example: Predicting house prices (supervised learning)
  • Supervised learning: Mapping input (e.g., house size) to output (e.g., house price)
  • Output is a continuous value (regression problem)

Motivation

  • Example of house prices: Data from Craigslist, Portland, Oregon
  • Training set: Size and price of houses
  • Goal: Fit a straight line to the data
  • Hypothesis: Function that inputs house size and outputs estimated price

Notation and Hypothesis Representation

  • Hypothesis function (H): Linear function of input size
    • Mathematically: h(x) = θ₀ + θ₁x₁ + θ₂x₂ + ...
    • For multiple input features: h(x) = Σ(θ_j * x_j) for j = 0 to n, where n = number of features
    • Parameters θ are chosen to fit the model
  • Notations: Different conventions for inputs (x), output (y), and training examples (x^(i), y^(i))

Cost Function (J(θ))

  • Goal: Minimize the error between predicted and actual values
  • Cost function J(θ): Measures how well the hypothesis fits the training data
    • Sum of squared errors between predicted and actual prices
    • J(θ) = 1/2M Σ(h_θ(x^(i)) - y^(i))²

Gradient Descent

  • Algorithm: Optimize parameters θ to minimize cost function J(θ)
  • Steps:
    • Initialize θ (e.g., to zero)
    • Update iteratively: θ_j := θ_j - α * ∂/∂θ_j J(θ)
  • Learning rate (α): Determines the step size in each iteration
  • Visualization: Process of moving towards the minimum cost by updating θ
  • Batch Gradient Descent: Uses all training examples in each iteration

Stochastic Gradient Descent

  • Updates parameters for each training example individually
  • Faster for large datasets
  • Downsides: May not converge but oscillate around minimum
  • Advantages: More efficient for large datasets compared to batch gradient descent

Normal Equations

  • Direct method to solve for θ without iterations
  • Applies only to linear regression
  • Derivation:
    • Represent cost function J(θ) in matrix-vector form
    • Use derivatives to solve for θ
    • Formula: θ = (X^TX)⁻¹X^Ty

Important Concepts and Notations

  • Design matrix (X): Matrix of input features
  • Vector y: Column vector of outputs (house prices)
  • Trace operator and properties used in the derivation
  • Matrix derivatives for efficient computation

Key Takeaways

  • Understanding linear regression and gradient descent is foundational for more complex algorithms
  • Gradient descent: iterative optimization technique, batch vs. stochastic
  • Normal equations: direct solution for linear regression

Common Questions

  • Learning rate choice: Empirically chosen, starting with common values (e.g., 0.01)
  • Handling non-invertible X in normal equations: Use pseudo-inverse or address redundant features