VGG CNN Overview | Coconote

Overview

This lecture provides a detailed, hands-on guide to convolutional neural networks (CNNs), focusing on the VGG architecture. You’ll learn how VGG works, its design principles, and how to implement a tiny VGG model from scratch in PyTorch for image classification tasks.

VGG Architecture & Philosophy

VGG (Visual Geometry Group) is a deep CNN known for stacking small 3x3 convolutions and max pooling layers.
Key pillars: simplicity (uniform layers), depth (many layers), and uniformity (repeating blocks).
Introduced at ILSVRC 2014, VGG made a significant shift from handcrafted features to deep feature learning.
Input images are resized to 224x224x3 (RGB), with standardized preprocessing by subtracting the mean RGB value.
All convolutions use 3x3 filters with stride 1 and padding 1 to preserve spatial dimensions.
ReLU activations add nonlinearity; max pooling (2x2) reduces dimensions and parameters, making computation efficient.
The model ends with fully connected (linear/dense) layers for classification (e.g., 1000 classes in ImageNet).

Convolutional Math & Design

Convolution extracts features/patterns from images via kernels (filters), which are learned during training.
Stacking 3x3 convolutions increases the effective receptive field and introduces more nonlinearities.
VGG omits local response normalization (LRN) for simplicity and efficiency.
Deeper models (VGG16: 13 conv layers, VGG19: 16 conv layers) generalize better than shallow ones.

Training VGG & Regularization

Loss function: cross-entropy for classification; optimizer: SGD with momentum (typically 0.9).
Batch size of 256 balances GPU efficiency and stable gradient estimation.
Dropout (0.5) in fully connected layers reduces overfitting.
Data augmentation includes flipping, scaling (jitter), RGB shift, rotation, and perspective adjustments.
VGG is typically trained for ~74 epochs with stepwise learning rate decay when validation accuracy plateaus.

Transfer Learning with VGG

Modes: (1) Use as-is, (2) Feature extraction (freeze convolution layers, train new classifier head), and (3) Fine-tuning (unfreeze and retrain some base layers).
Benefits: reduces training time and data requirements, leveraging learned representations.
VGG is widely used beyond vision, e.g., medical imaging and style transfer.

Visualization & Analysis

Early layers detect edges/textures; deeper layers recognize shapes and complex objects.
Activation maps (feature maps) show which regions contribute most; GradCAM highlights areas critical for prediction.
VGG variants include VGG-BN (with batch normalization), VGG-Face (facial recognition), and tiny VGG (for fast experimentation or mobile inference).

Data Preparation & PyTorch Implementation

Standard folder format: root/train/class_x , root/test/class_y, etc.
Data loaded and transformed (resize, flip, to tensor) using torchvision’s ImageFolder and DataLoader classes.
Custom functions provided for visualizing images, displaying samples, and plotting transforms.

Building & Training Tiny VGG in PyTorch

Model is built using nn.Module with two conv blocks (Conv2D + ReLU + MaxPool) and a classifier head (Flatten + Linear).
Input: images resized to 64x64x3 for the tiny VGG; can be customized.
Model summary and shape checks done with torchinfo.
Training loop consists of training (forward, loss, backward, optimizer step) and testing phases, using accuracy as the key metric.
Results plotted to visualize loss and accuracy curves; discusses signs of overfitting and the effect of hyperparameters.

Key Terms & Definitions

Convolution — Mathematical operation applying a filter/kernel over input data to extract features.
Kernel/Filter — Small matrix of weights, learned during training, that slides over input to capture patterns.
ReLU (Rectified Linear Unit) — Nonlinear activation function; outputs max(0, x).
Max Pooling — Downsamples feature maps by taking the maximum value in each window.
Feature Map/Activation Map — Output of a convolution layer; represents detected features.
Batch Size — Number of samples processed before the model updates its weights.
Dropout — Randomly disables neurons during training to reduce overfitting.
Transfer Learning — Reusing a pretrained model on a new, related task.
GradCAM — Visualization technique highlighting important image regions for a given prediction.
DataLoader — PyTorch utility for batching and shuffling data during training.

Action Items / Next Steps

Review the VGG architecture and experiment with different model hyperparameters (filters, layers, learning rate).
Practice data augmentation and visualization techniques on your own datasets.
Implement and train a tiny VGG model in PyTorch using the steps and templates provided.
Plot and interpret training/testing loss and accuracy curves to diagnose overfitting.
Explore transfer learning by modifying the classifier head for other tasks or datasets.