Neural Networks and CNNs Explained

Aug 29, 2024

Lecture on Neural Networks and Convolutional Neural Networks

Administrative Information

  • Justin, a co-instructor, was introduced.
  • Assignment 2 is out:
    • It's long; start early.
    • Due next Friday.
    • Involves implementing neural networks, forward/backward passes, batch normalization, dropout, and convolutional networks.

Training Neural Networks

  • Four-Step Process:
    1. Sample a small batch from the dataset.
    2. Forward propagate to get the loss.
    3. Backpropagate to compute gradients.
    4. Perform parameter update.
  • Importance of activation functions:
    • Without them, the network is just a linear classifier.
    • Critical for fitting data.
  • Weight Initialization:
    • Too small: activations towards zero.
    • Too large: activations explode.
    • Xavier initialization provides a balanced start.
  • Batch Normalization:
    • Alleviates weight initialization issues.
    • Makes training more robust.

Parameter Update Schemes

  • Stochastic Gradient Descent (SGD):
    • Directly scales the gradient by the learning rate.
  • Momentum Update:
    • Uses past gradients to build velocity.
    • Helps in speeding up shallow regions and damping oscillations in steep regions.
  • Nesterov Momentum:
    • Looks ahead to evaluate the gradient.
    • Provides faster convergence than standard momentum.
  • Adaptive Gradient (AdaGrad):
    • Adjusts learning rates based on historical gradients.
    • Can lead to decaying learning rates to zero over time.
  • RMSProp:
    • Leaky version of AdaGrad.
    • Prevents learning rates from decaying to zero.
  • Adam:
    • Combines momentum and RMSProp.
    • Generally the best default choice.

Second-Order Methods

  • Use gradient and Hessian (curvature) information.
  • Faster convergence, no learning rate needed.
  • Impractical due to memory and computational complexity.

Learning Rate Decay

  • Start with a high learning rate and decay it over time.
  • Various decay schemes: step decay, exponential decay, etc.

Model Ensembles

  • Training multiple models and averaging results improves performance.
  • Techniques to simulate ensembles:
    • Averaging checkpoints.
    • Using a running average of weights during test time.

Dropout

  • Set some neurons to zero during training to prevent overfitting.
  • Encourages redundancy in feature representation.
  • At test time, scale activations to match training expectations.
  • "Inverted Dropout" scales during training instead of testing.

Convolutional Neural Networks (CNNs)

  • Historical context: Inspired by Hubel and Wiesel's visual cortex studies.
  • Layers of simple and complex cells.
  • Architecture advances:
    • From early models (Neocognitron) to modern (AlexNet, VGG, etc.).
  • Applications:
    • Image classification, retrieval, detection, segmentation.
    • Non-visual tasks: speech, text, etc.
  • Real-world uses:
    • Self-driving cars, facial recognition, medical imaging, etc.

Summary

  • Use Adam for parameter updates as a default choice.
  • Explore model ensembles for performance gains.
  • Dropout effectively reduces overfitting by promoting redundancy.
  • CNNs are powerful tools for a wide range of applications beyond just image processing.