Linear Regression, Gradient Descent, and Normal Equations

Introduction

Simplest learning algorithm
Example: Predicting house prices (supervised learning)
Supervised learning: Mapping input (e.g., house size) to output (e.g., house price)
Output is a continuous value (regression problem)

Hypothesis function (H): Linear function of input size
- Mathematically: h(x) = θ₀ + θ₁x₁ + θ₂x₂ + ...
- For multiple input features: h(x) = Σ(θ_j * x_j) for j = 0 to n, where n = number of features
- Parameters θ are chosen to fit the model
Notations: Different conventions for inputs (x), output (y), and training examples (x^(i), y^(i))

Goal: Minimize the error between predicted and actual values
Cost function J(θ): Measures how well the hypothesis fits the training data
- Sum of squared errors between predicted and actual prices
- J(θ) = 1/2M Σ(h_θ(x^(i)) - y^(i))²

Algorithm: Optimize parameters θ to minimize cost function J(θ)
Steps:
- Initialize θ (e.g., to zero)
- Update iteratively: θ_j := θ_j - α * ∂/∂θ_j J(θ)
Learning rate (α): Determines the step size in each iteration
Visualization: Process of moving towards the minimum cost by updating θ
Batch Gradient Descent: Uses all training examples in each iteration

Updates parameters for each training example individually
Faster for large datasets
Downsides: May not converge but oscillate around minimum
Advantages: More efficient for large datasets compared to batch gradient descent

Direct method to solve for θ without iterations
Applies only to linear regression
Derivation:
- Represent cost function J(θ) in matrix-vector form
- Use derivatives to solve for θ
- Formula: θ = (X^TX)⁻¹X^Ty

Understanding linear regression and gradient descent is foundational for more complex algorithms
Gradient descent: iterative optimization technique, batch vs. stochastic
Normal equations: direct solution for linear regression

Learning rate choice: Empirically chosen, starting with common values (e.g., 0.01)
Handling non-invertible X in normal equations: Use pseudo-inverse or address redundant features

Notation and Hypothesis Representation

Hypothesis function (H): Linear function of input size

Mathematically: h(x) = θ₀ + θ₁x₁ + θ₂x₂ + ...
For multiple input features: h(x) = Σ(θ_j * x_j) for j = 0 to n, where n = number of features
Parameters θ are chosen to fit the model

Notations: Different conventions for inputs (x), output (y), and training examples (x^(i), y^(i))

Gradient Descent

Algorithm: Optimize parameters θ to minimize cost function J(θ)

Steps:

Learning rate (α): Determines the step size in each iteration

Visualization: Process of moving towards the minimum cost by updating θ

Batch Gradient Descent: Uses all training examples in each iteration

Important Concepts and Notations

Design matrix (X): Matrix of input features

Vector y: Column vector of outputs (house prices)

Trace operator and properties used in the derivation

Matrix derivatives for efficient computation

Key Takeaways

Understanding linear regression and gradient descent is foundational for more complex algorithms

Gradient descent: iterative optimization technique, batch vs. stochastic

Normal equations: direct solution for linear regression

Common Questions

Learning rate choice: Empirically chosen, starting with common values (e.g., 0.01)

Handling non-invertible X in normal equations: Use pseudo-inverse or address redundant features