Vision Transformers Lecture Notes
Introduction
- Today's topic: Vision Transformers
- Related to Transformers in GPT (Generative Pre-trained Transformers)
- Background on Transformers:
- Initially designed for Natural Language Processing (NLP)
- Adopted for some vision tasks
Transformers Overview
- GPT/ChatGPT: Examples of Transformer architectures generating textual data
- Vision Transformers: Specifically for image recognition
Dataset: CIFAR
- CIFAR dataset used for benchmarking
- Types of CIFAR datasets:
- CIFAR-10: 10 classes
- CIFAR-100: 100 classes divided into 20 superclasses
- CIFAR dataset details:
- Total 60,000 images (50,000 training, 10,000 testing)
- Image dimension: 32x32 pixels, hence requires efficient handling
Vision Transformer (ViT) Architecture
- Patch Embeddings: Splits image into smaller patches (non-overlapping)
- Positional Embeddings: Retains spatial information of the patches
- Transformer Encoder: Comprises of two main components:
- Self-attention mechanisms (multi-head attention)
- Feed-forward neural network (Multi-Layer Perceptrons - MLP)
- Classification Head: Final part responsible for classifying the images
Key Concepts:
- Multi-head attention: Enables the model to capture global dependencies and relationships between patches
- MLP (Multi-Layer Perceptrons): Multiple interconnected layers of neurons interpreting the data
- Positional Embeddings: Maintain spatial relationships in one-dimensional vectors
- Example: Recognizing letters and their order within a word
- Self-attention mechanism: Determines the relevance of each patch globally
Implementation Steps
- Input Layer: 32x32 image dimension, enhanced to 72x72 for practical purposes using OpenCV
- Data Augmentation: Alters training data to improve robustness (e.g., flipping images)
- Patches Creation: Splitting images into patches for processing
- Transformer Block: Core processing block of the ViT
- Flattening: Converting processed patches into vectors for classification
- Output Layer: Dense layer to classify the image based on learned features
Hyperparameters
- Defined hyperparameters guide the architecture and learning process:
- Learning rate
- Weight decay
- Batch size
- Image size
- Patch size
- Projection dimensions
- Attention heads
- Transformer units
- MLP head units
- Adjusting hyperparameters based on learning performance to optimize model efficiency
Practical Tips
- Maintain spatial information and relationships in patches through positional embeddings
- Use batched data processing to handle large datasets efficiently
- Employ data augmentation techniques to improve model generalization
- Tweak hyperparameters iteratively to find the optimal configuration
Upcoming Steps
- Next session to cover:
- Data augmentation techniques
- Implementation details of the Transformer block
- Coding specific elements like embedding, patches, classification head
Q&A
- Clarified weights of hyperparameters and their importance
- Ensured understanding of core concepts through examples
Conclusion
- Emphasis on understanding Vision Transformers and their practical application
- Encouragement to experiment with datasets and hyperparameters
Next Steps: Continue with detailed coding and architectural implementation in the next session.