Regularization in Neural Networks

Overview

This lecture explains why regularization helps reduce overfitting and variance in neural networks, with intuition, examples, and implementation tips.

Overfitting occurs when a large, complex neural network models training data too closely, capturing noise.
Regularization adds a penalty to the cost function, discouraging large weights in the network.
A high regularization parameter (lambda) pushes weights toward zero, simplifying the network.
Simpler networks are less likely to overfit, behaving more like shallow models (e.g., logistic regression).

For tanh activation, small weights mean pre-activation values (Z) remain in a near-linear region.
With linear activations, networks can only model simple (linear) functions.
Small weights across layers keep activations in the linear range, reducing the network’s ability to fit complex, nonlinear boundaries.
This limits overfitting by preventing the model from capturing intricate patterns in noise.

The regularized cost function J includes both the loss and the regularization penalty.
When plotting the cost during gradient descent, use the new (regularized) definition of J.
Plotting only the original loss may make it seem like the cost is not decreasing monotonically.

Overfitting — When a model learns noise and details in the training data, reducing its generalization ability.
Regularization — Technique that penalizes large model weights to simplify the model and prevent overfitting.
L2 Regularization (Frobenius norm) — Adds the sum of squared weights to the cost function.
Lambda (λ) — The regularization parameter controlling penalty strength.
Variance — Model sensitivity to small fluctuations in the training set.
Tanh Activation Function — A nonlinear function used in neural networks, near-linear around zero.

When implementing regularization, always plot the cost including the penalty term.
Explore dropout regularization as covered in the next lecture.