Vision Transformers Lecture

Jul 18, 2024

Vision Transformers Lecture Notes

Introduction

  • Today's topic: Vision Transformers
  • Related to Transformers in GPT (Generative Pre-trained Transformers)
  • Background on Transformers:
    • Initially designed for Natural Language Processing (NLP)
    • Adopted for some vision tasks

Transformers Overview

  • GPT/ChatGPT: Examples of Transformer architectures generating textual data
  • Vision Transformers: Specifically for image recognition

Dataset: CIFAR

  • CIFAR dataset used for benchmarking
  • Types of CIFAR datasets:
    • CIFAR-10: 10 classes
    • CIFAR-100: 100 classes divided into 20 superclasses
  • CIFAR dataset details:
    • Total 60,000 images (50,000 training, 10,000 testing)
    • Image dimension: 32x32 pixels, hence requires efficient handling

Vision Transformer (ViT) Architecture

  • Patch Embeddings: Splits image into smaller patches (non-overlapping)
  • Positional Embeddings: Retains spatial information of the patches
  • Transformer Encoder: Comprises of two main components:
    • Self-attention mechanisms (multi-head attention)
    • Feed-forward neural network (Multi-Layer Perceptrons - MLP)
  • Classification Head: Final part responsible for classifying the images

Key Concepts:

  • Multi-head attention: Enables the model to capture global dependencies and relationships between patches
  • MLP (Multi-Layer Perceptrons): Multiple interconnected layers of neurons interpreting the data
  • Positional Embeddings: Maintain spatial relationships in one-dimensional vectors
    • Example: Recognizing letters and their order within a word
  • Self-attention mechanism: Determines the relevance of each patch globally

Implementation Steps

  • Input Layer: 32x32 image dimension, enhanced to 72x72 for practical purposes using OpenCV
  • Data Augmentation: Alters training data to improve robustness (e.g., flipping images)
  • Patches Creation: Splitting images into patches for processing
  • Transformer Block: Core processing block of the ViT
  • Flattening: Converting processed patches into vectors for classification
  • Output Layer: Dense layer to classify the image based on learned features

Hyperparameters

  • Defined hyperparameters guide the architecture and learning process:
    • Learning rate
    • Weight decay
    • Batch size
    • Image size
    • Patch size
    • Projection dimensions
    • Attention heads
    • Transformer units
    • MLP head units
  • Adjusting hyperparameters based on learning performance to optimize model efficiency

Practical Tips

  • Maintain spatial information and relationships in patches through positional embeddings
  • Use batched data processing to handle large datasets efficiently
  • Employ data augmentation techniques to improve model generalization
  • Tweak hyperparameters iteratively to find the optimal configuration

Upcoming Steps

  • Next session to cover:
    • Data augmentation techniques
    • Implementation details of the Transformer block
    • Coding specific elements like embedding, patches, classification head

Q&A

  • Clarified weights of hyperparameters and their importance
  • Ensured understanding of core concepts through examples

Conclusion

  • Emphasis on understanding Vision Transformers and their practical application
  • Encouragement to experiment with datasets and hyperparameters

Next Steps: Continue with detailed coding and architectural implementation in the next session.