🖼️

VGG CNN Overview

Jul 9, 2025

Overview

This lecture provides a detailed, hands-on guide to convolutional neural networks (CNNs), focusing on the VGG architecture. You’ll learn how VGG works, its design principles, and how to implement a tiny VGG model from scratch in PyTorch for image classification tasks.

VGG Architecture & Philosophy

  • VGG (Visual Geometry Group) is a deep CNN known for stacking small 3x3 convolutions and max pooling layers.
  • Key pillars: simplicity (uniform layers), depth (many layers), and uniformity (repeating blocks).
  • Introduced at ILSVRC 2014, VGG made a significant shift from handcrafted features to deep feature learning.
  • Input images are resized to 224x224x3 (RGB), with standardized preprocessing by subtracting the mean RGB value.
  • All convolutions use 3x3 filters with stride 1 and padding 1 to preserve spatial dimensions.
  • ReLU activations add nonlinearity; max pooling (2x2) reduces dimensions and parameters, making computation efficient.
  • The model ends with fully connected (linear/dense) layers for classification (e.g., 1000 classes in ImageNet).

Convolutional Math & Design

  • Convolution extracts features/patterns from images via kernels (filters), which are learned during training.
  • Stacking 3x3 convolutions increases the effective receptive field and introduces more nonlinearities.
  • VGG omits local response normalization (LRN) for simplicity and efficiency.
  • Deeper models (VGG16: 13 conv layers, VGG19: 16 conv layers) generalize better than shallow ones.

Training VGG & Regularization

  • Loss function: cross-entropy for classification; optimizer: SGD with momentum (typically 0.9).
  • Batch size of 256 balances GPU efficiency and stable gradient estimation.
  • Dropout (0.5) in fully connected layers reduces overfitting.
  • Data augmentation includes flipping, scaling (jitter), RGB shift, rotation, and perspective adjustments.
  • VGG is typically trained for ~74 epochs with stepwise learning rate decay when validation accuracy plateaus.

Transfer Learning with VGG

  • Modes: (1) Use as-is, (2) Feature extraction (freeze convolution layers, train new classifier head), and (3) Fine-tuning (unfreeze and retrain some base layers).
  • Benefits: reduces training time and data requirements, leveraging learned representations.
  • VGG is widely used beyond vision, e.g., medical imaging and style transfer.

Visualization & Analysis

  • Early layers detect edges/textures; deeper layers recognize shapes and complex objects.
  • Activation maps (feature maps) show which regions contribute most; GradCAM highlights areas critical for prediction.
  • VGG variants include VGG-BN (with batch normalization), VGG-Face (facial recognition), and tiny VGG (for fast experimentation or mobile inference).

Data Preparation & PyTorch Implementation

  • Standard folder format: root/train/class_x , root/test/class_y, etc.
  • Data loaded and transformed (resize, flip, to tensor) using torchvision’s ImageFolder and DataLoader classes.
  • Custom functions provided for visualizing images, displaying samples, and plotting transforms.

Building & Training Tiny VGG in PyTorch

  • Model is built using nn.Module with two conv blocks (Conv2D + ReLU + MaxPool) and a classifier head (Flatten + Linear).
  • Input: images resized to 64x64x3 for the tiny VGG; can be customized.
  • Model summary and shape checks done with torchinfo.
  • Training loop consists of training (forward, loss, backward, optimizer step) and testing phases, using accuracy as the key metric.
  • Results plotted to visualize loss and accuracy curves; discusses signs of overfitting and the effect of hyperparameters.

Key Terms & Definitions

  • Convolution — Mathematical operation applying a filter/kernel over input data to extract features.
  • Kernel/Filter — Small matrix of weights, learned during training, that slides over input to capture patterns.
  • ReLU (Rectified Linear Unit) — Nonlinear activation function; outputs max(0, x).
  • Max Pooling — Downsamples feature maps by taking the maximum value in each window.
  • Feature Map/Activation Map — Output of a convolution layer; represents detected features.
  • Batch Size — Number of samples processed before the model updates its weights.
  • Dropout — Randomly disables neurons during training to reduce overfitting.
  • Transfer Learning — Reusing a pretrained model on a new, related task.
  • GradCAM — Visualization technique highlighting important image regions for a given prediction.
  • DataLoader — PyTorch utility for batching and shuffling data during training.

Action Items / Next Steps

  • Review the VGG architecture and experiment with different model hyperparameters (filters, layers, learning rate).
  • Practice data augmentation and visualization techniques on your own datasets.
  • Implement and train a tiny VGG model in PyTorch using the steps and templates provided.
  • Plot and interpret training/testing loss and accuracy curves to diagnose overfitting.
  • Explore transfer learning by modifying the classifier head for other tasks or datasets.