🖼️

Understanding Vision Transformers in Computer Vision

Aug 12, 2024

Notes on Vision Transformer Lecture

Introduction

  • Speaker: Rueda Tiense, Professor at the University of the Philippines
  • Topic: Recent breakthrough in computer vision: Vision Transformer (VIVE)
  • Outline:
    • Importance of Vision Transformer
    • Concept of Attention
    • Model Architecture of Vision Transformer
    • Applications
    • Limitations and Future Directions

Why Vision Transformer Matters

  • Transformer Architecture:
    • High-capacity network architecture capable of approximating complex functions.
    • Dominant in natural language processing (NLP) but less prevalent in computer vision (CV).
    • CNNs have been the traditional model for CV.
  • Breakthrough:
    • Google researchers made transformers applicable to vision.
    • Vision Transformer outperforms CNN-based models in various tasks (recognition, detection, segmentation).
  • General-purpose Architecture:
    • Can process different data formats (text, audio, image, video).
    • Promotes multimodal learning, reflecting the inherently multimodal nature of the world.

Attention Mechanism

  • Concept:
    • Attention highlights relevant features in images.
    • Example with a bird photo: Patches from the bird have high attention; background patches have low attention.
  • Mathematical Expression:
    • Patches converted into high-dimensional feature vectors.
    • Attention computed using dot products between these vectors.
  • NLP Perspective:
    • High attention between related words (e.g., "brown" and "fox"); low attention otherwise (e.g., "brown" and "dog").

Building the Vision Transformer Model

Image Pre-processing

  • Image to Patches:
    • Input image is split into patches (9 used in example, 196 for ImageNet).
  • Linear Projection:
    • Each patch reshaped into a 1D vector, multiplied by a weight matrix to produce feature vectors.
    • Can be performed by dense layers or strided convolution.
  • Position Embedding:
    • Necessary to provide the model with positional information, as transformers lack the notion of equivariance.
    • Different algorithms available (sinusoidal, learnable embeddings, rotary embedding).

Encoder Module

  • Encoder Structure:
    • Composed of self-attention and MLP blocks.
  • Attention Function:
    • Defined by query, key, and value.
    • Uses softmax to convert normalized dot products into probabilities.
  • Improvements:
    • Layer normalization stabilizes training.
    • Skip connections enhance performance by propagating representations across layers.
    • Multi-head self-attention improves representations by capturing diverse features.

Model Architecture Summary

  • Overall structure from input image to linear projection, through stacked encoder modules, to MLP head for classification.

Different Versions of Vision Transformer

  • Model Sizes:
    • BASE: 12 encoder layers, 768 hidden size, 3,072 MLP size, 12 heads (86 million parameters).
    • Smaller versions: Small (21 million parameters), Tiny (5 million parameters).
  • Training:
    • Pre-train on large datasets like JFT300M before fine-tuning on specific tasks.
    • Performance improves significantly with larger datasets.

Applications and Performance

  • Segmentation Models:
    • SET-R and SegFormer utilizing vision transformers for state-of-the-art results.
  • Hybrid Architectures:
    • Models like TransUnet and VidSTR combining CNN and transformer approaches for improved performance.

Limitations and Future Directions

  • Challenges:
    • Quadratic cost of computing attention.
    • High parameter count making it challenging for resource-constrained environments.
  • Future Research:
    • Investigate hybrid architectures and depthwise convolutions in attention.

Broader Implications

  • Transformers as general-purpose networks applicable to various data types.
  • Perceiver.io demonstrates training capabilities without strict input/output structures.

Implementation

  • Open-source implementations available for practical use.
  • Simplified creation of Vision Transformer architectures using libraries like PyTorch.

Closing

  • Speaker's GitHub profile contains open-source implementations and lecture notes.