Lecture on Vision Transformer (ViT)

Introduction to Vision Transformer

Vision Transformer (ViT) is a new state-of-the-art model for image classification.
Surpasses the accuracy of ConvNets like ResNet.
ViT leverages the Transformer architecture originally developed for NLP.

Task: Infer the content of an image (e.g., identify a dog in a picture).
Neural network outputs a vector p where each element represents the confidence for a class.
Example: If the dog class has the highest confidence of 0.4, it means 40% confidence.

Partitioning of Images:
- Input image is split into patches (e.g., 16x16), similar to tokens in NLP.
- Patches can be non-overlapping or overlapping.
- Stride determines the movement of the sliding window to create patches.
Vectorization:
- Patches are reshaped into vectors.
- Linear transformation (dense layer) is applied to these vectors without non-linear activation.
Positional Encoding:
- Necessary to encode positional information since ViT uses permutation-invariant transformers.
- Positional encoding prevents identical outputs from swapped patches.
- Various methods can be used; all yield similar accuracy.

Vectors Z1 to Zn represent patches after linear transformation and positional encoding.
CLS Token:
- Used for classification, output by embedding layer as vector Z0.
Transformer Layers:
- Sequence Z0 to Zn processed by multi-head self-attention layers and dense layers.
- Optional addition of skip connections and normalization.
Output:
- Final output vector C0 used for classification, fed into a Softmax classifier.

Pre-training and Fine-tuning Steps:
- Pre-train on large dataset A (e.g., 300 million images).
- Fine-tune on smaller, target dataset B (e.g., ImageNet).
- Evaluate based on test accuracy on dataset B.
Datasets:
- Small ImageNet: 1.3 million images, 1000 classes.
- Big ImageNet: 14 million images, 21,000 classes.
- GFT (Google's private data): 300 million images, 18,000 classes.