Vision Transformers Lecture Notes

Introduction

Today's topic: Vision Transformers
Related to Transformers in GPT (Generative Pre-trained Transformers)
Background on Transformers:
- Initially designed for Natural Language Processing (NLP)
- Adopted for some vision tasks

CIFAR dataset used for benchmarking
Types of CIFAR datasets:
- CIFAR-10: 10 classes
- CIFAR-100: 100 classes divided into 20 superclasses
CIFAR dataset details:
- Total 60,000 images (50,000 training, 10,000 testing)
- Image dimension: 32x32 pixels, hence requires efficient handling

Patch Embeddings: Splits image into smaller patches (non-overlapping)
Positional Embeddings: Retains spatial information of the patches
Transformer Encoder: Comprises of two main components:
- Self-attention mechanisms (multi-head attention)
- Feed-forward neural network (Multi-Layer Perceptrons - MLP)
Classification Head: Final part responsible for classifying the images

Multi-head attention: Enables the model to capture global dependencies and relationships between patches
MLP (Multi-Layer Perceptrons): Multiple interconnected layers of neurons interpreting the data
Positional Embeddings: Maintain spatial relationships in one-dimensional vectors
- Example: Recognizing letters and their order within a word
Self-attention mechanism: Determines the relevance of each patch globally

Input Layer: 32x32 image dimension, enhanced to 72x72 for practical purposes using OpenCV
Data Augmentation: Alters training data to improve robustness (e.g., flipping images)
Patches Creation: Splitting images into patches for processing
Transformer Block: Core processing block of the ViT
Flattening: Converting processed patches into vectors for classification
Output Layer: Dense layer to classify the image based on learned features

Defined hyperparameters guide the architecture and learning process:
- Learning rate
- Weight decay
- Batch size
- Image size
- Patch size
- Projection dimensions
- Attention heads
- Transformer units
- MLP head units
Adjusting hyperparameters based on learning performance to optimize model efficiency

Maintain spatial information and relationships in patches through positional embeddings
Use batched data processing to handle large datasets efficiently
Employ data augmentation techniques to improve model generalization
Tweak hyperparameters iteratively to find the optimal configuration

Next session to cover:
- Data augmentation techniques
- Implementation details of the Transformer block
- Coding specific elements like embedding, patches, classification head

Next Steps: Continue with detailed coding and architectural implementation in the next session.