Overview
This lecture provides a detailed, hands-on guide to convolutional neural networks (CNNs), focusing on the VGG architecture. You’ll learn how VGG works, its design principles, and how to implement a tiny VGG model from scratch in PyTorch for image classification tasks.
VGG Architecture & Philosophy
- VGG (Visual Geometry Group) is a deep CNN known for stacking small 3x3 convolutions and max pooling layers.
- Key pillars: simplicity (uniform layers), depth (many layers), and uniformity (repeating blocks).
- Introduced at ILSVRC 2014, VGG made a significant shift from handcrafted features to deep feature learning.
- Input images are resized to 224x224x3 (RGB), with standardized preprocessing by subtracting the mean RGB value.
- All convolutions use 3x3 filters with stride 1 and padding 1 to preserve spatial dimensions.
- ReLU activations add nonlinearity; max pooling (2x2) reduces dimensions and parameters, making computation efficient.
- The model ends with fully connected (linear/dense) layers for classification (e.g., 1000 classes in ImageNet).
Convolutional Math & Design
- Convolution extracts features/patterns from images via kernels (filters), which are learned during training.
- Stacking 3x3 convolutions increases the effective receptive field and introduces more nonlinearities.
- VGG omits local response normalization (LRN) for simplicity and efficiency.
- Deeper models (VGG16: 13 conv layers, VGG19: 16 conv layers) generalize better than shallow ones.
Training VGG & Regularization
- Loss function: cross-entropy for classification; optimizer: SGD with momentum (typically 0.9).
- Batch size of 256 balances GPU efficiency and stable gradient estimation.
- Dropout (0.5) in fully connected layers reduces overfitting.
- Data augmentation includes flipping, scaling (jitter), RGB shift, rotation, and perspective adjustments.
- VGG is typically trained for ~74 epochs with stepwise learning rate decay when validation accuracy plateaus.
Transfer Learning with VGG
- Modes: (1) Use as-is, (2) Feature extraction (freeze convolution layers, train new classifier head), and (3) Fine-tuning (unfreeze and retrain some base layers).
- Benefits: reduces training time and data requirements, leveraging learned representations.
- VGG is widely used beyond vision, e.g., medical imaging and style transfer.
Visualization & Analysis
- Early layers detect edges/textures; deeper layers recognize shapes and complex objects.
- Activation maps (feature maps) show which regions contribute most; GradCAM highlights areas critical for prediction.
- VGG variants include VGG-BN (with batch normalization), VGG-Face (facial recognition), and tiny VGG (for fast experimentation or mobile inference).
Data Preparation & PyTorch Implementation
- Standard folder format: root/train/class_x , root/test/class_y, etc.
- Data loaded and transformed (resize, flip, to tensor) using torchvision’s ImageFolder and DataLoader classes.
- Custom functions provided for visualizing images, displaying samples, and plotting transforms.
Building & Training Tiny VGG in PyTorch
- Model is built using nn.Module with two conv blocks (Conv2D + ReLU + MaxPool) and a classifier head (Flatten + Linear).
- Input: images resized to 64x64x3 for the tiny VGG; can be customized.
- Model summary and shape checks done with torchinfo.
- Training loop consists of training (forward, loss, backward, optimizer step) and testing phases, using accuracy as the key metric.
- Results plotted to visualize loss and accuracy curves; discusses signs of overfitting and the effect of hyperparameters.
Key Terms & Definitions
- Convolution — Mathematical operation applying a filter/kernel over input data to extract features.
- Kernel/Filter — Small matrix of weights, learned during training, that slides over input to capture patterns.
- ReLU (Rectified Linear Unit) — Nonlinear activation function; outputs max(0, x).
- Max Pooling — Downsamples feature maps by taking the maximum value in each window.
- Feature Map/Activation Map — Output of a convolution layer; represents detected features.
- Batch Size — Number of samples processed before the model updates its weights.
- Dropout — Randomly disables neurons during training to reduce overfitting.
- Transfer Learning — Reusing a pretrained model on a new, related task.
- GradCAM — Visualization technique highlighting important image regions for a given prediction.
- DataLoader — PyTorch utility for batching and shuffling data during training.
Action Items / Next Steps
- Review the VGG architecture and experiment with different model hyperparameters (filters, layers, learning rate).
- Practice data augmentation and visualization techniques on your own datasets.
- Implement and train a tiny VGG model in PyTorch using the steps and templates provided.
- Plot and interpret training/testing loss and accuracy curves to diagnose overfitting.
- Explore transfer learning by modifying the classifier head for other tasks or datasets.