Lecture on Neural Networks and Convolutional Neural Networks

Administrative Information

Justin, a co-instructor, was introduced.
Assignment 2 is out:
- It's long; start early.
- Due next Friday.
- Involves implementing neural networks, forward/backward passes, batch normalization, dropout, and convolutional networks.

Four-Step Process:
1. Sample a small batch from the dataset.
2. Forward propagate to get the loss.
3. Backpropagate to compute gradients.
4. Perform parameter update.
Importance of activation functions:
- Without them, the network is just a linear classifier.
- Critical for fitting data.
Weight Initialization:
- Too small: activations towards zero.
- Too large: activations explode.
- Xavier initialization provides a balanced start.
Batch Normalization:
- Alleviates weight initialization issues.
- Makes training more robust.

Stochastic Gradient Descent (SGD):
- Directly scales the gradient by the learning rate.
Momentum Update:
- Uses past gradients to build velocity.
- Helps in speeding up shallow regions and damping oscillations in steep regions.
Nesterov Momentum:
- Looks ahead to evaluate the gradient.
- Provides faster convergence than standard momentum.
Adaptive Gradient (AdaGrad):
- Adjusts learning rates based on historical gradients.
- Can lead to decaying learning rates to zero over time.
RMSProp:
- Leaky version of AdaGrad.
- Prevents learning rates from decaying to zero.
Adam:
- Combines momentum and RMSProp.
- Generally the best default choice.

Training multiple models and averaging results improves performance.
Techniques to simulate ensembles:
- Averaging checkpoints.
- Using a running average of weights during test time.

Historical context: Inspired by Hubel and Wiesel's visual cortex studies.
Layers of simple and complex cells.
Architecture advances:
- From early models (Neocognitron) to modern (AlexNet, VGG, etc.).
Applications:
- Image classification, retrieval, detection, segmentation.
- Non-visual tasks: speech, text, etc.
Real-world uses:
- Self-driving cars, facial recognition, medical imaging, etc.

Use Adam for parameter updates as a default choice.
Explore model ensembles for performance gains.
Dropout effectively reduces overfitting by promoting redundancy.
CNNs are powerful tools for a wide range of applications beyond just image processing.