Notes on Vision Transformer Lecture

Introduction

Transformer Architecture:
- High-capacity network architecture capable of approximating complex functions.
- Dominant in natural language processing (NLP) but less prevalent in computer vision (CV).
- CNNs have been the traditional model for CV.
Breakthrough:
- Google researchers made transformers applicable to vision.
- Vision Transformer outperforms CNN-based models in various tasks (recognition, detection, segmentation).
General-purpose Architecture:
- Can process different data formats (text, audio, image, video).
- Promotes multimodal learning, reflecting the inherently multimodal nature of the world.

Concept:
- Attention highlights relevant features in images.
- Example with a bird photo: Patches from the bird have high attention; background patches have low attention.
Mathematical Expression:
- Patches converted into high-dimensional feature vectors.
- Attention computed using dot products between these vectors.
NLP Perspective:
- High attention between related words (e.g., "brown" and "fox"); low attention otherwise (e.g., "brown" and "dog").

Image to Patches:
- Input image is split into patches (9 used in example, 196 for ImageNet).
Linear Projection:
- Each patch reshaped into a 1D vector, multiplied by a weight matrix to produce feature vectors.
- Can be performed by dense layers or strided convolution.
Position Embedding:
- Necessary to provide the model with positional information, as transformers lack the notion of equivariance.
- Different algorithms available (sinusoidal, learnable embeddings, rotary embedding).

Encoder Structure:
- Composed of self-attention and MLP blocks.
Attention Function:
- Defined by query, key, and value.
- Uses softmax to convert normalized dot products into probabilities.
Improvements:
- Layer normalization stabilizes training.
- Skip connections enhance performance by propagating representations across layers.
- Multi-head self-attention improves representations by capturing diverse features.

Overall structure from input image to linear projection, through stacked encoder modules, to MLP head for classification.

Model Sizes:
- BASE: 12 encoder layers, 768 hidden size, 3,072 MLP size, 12 heads (86 million parameters).
- Smaller versions: Small (21 million parameters), Tiny (5 million parameters).
Training:
- Pre-train on large datasets like JFT300M before fine-tuning on specific tasks.
- Performance improves significantly with larger datasets.

Segmentation Models:
- SET-R and SegFormer utilizing vision transformers for state-of-the-art results.
Hybrid Architectures:
- Models like TransUnet and VidSTR combining CNN and transformer approaches for improved performance.

Challenges:
- Quadratic cost of computing attention.
- High parameter count making it challenging for resource-constrained environments.
Future Research:
- Investigate hybrid architectures and depthwise convolutions in attention.

Transformers as general-purpose networks applicable to various data types.
Perceiver.io demonstrates training capabilities without strict input/output structures.

Open-source implementations available for practical use.
Simplified creation of Vision Transformer architectures using libraries like PyTorch.

Speaker's GitHub profile contains open-source implementations and lecture notes.