🖼️

Understanding Vision Transformers in Image Classification

Nov 19, 2024

Lecture on Vision Transformer (ViT)

Introduction to Vision Transformer

  • Vision Transformer (ViT) is a new state-of-the-art model for image classification.
  • Surpasses the accuracy of ConvNets like ResNet.
  • ViT leverages the Transformer architecture originally developed for NLP.

Image Classification

  • Task: Infer the content of an image (e.g., identify a dog in a picture).
  • Neural network outputs a vector p where each element represents the confidence for a class.
  • Example: If the dog class has the highest confidence of 0.4, it means 40% confidence.

Vision Transformer vs. ConvNets

  • ResNet was previously the best solution for image classification.
  • ViT surpasses ResNet when pre-trained on a sufficiently large dataset.
  • Advantage of ViT over ResNet increases with larger datasets.

Architecture of Vision Transformer

  • Partitioning of Images:

    • Input image is split into patches (e.g., 16x16), similar to tokens in NLP.
    • Patches can be non-overlapping or overlapping.
    • Stride determines the movement of the sliding window to create patches.
  • Vectorization:

    • Patches are reshaped into vectors.
    • Linear transformation (dense layer) is applied to these vectors without non-linear activation.
  • Positional Encoding:

    • Necessary to encode positional information since ViT uses permutation-invariant transformers.
    • Positional encoding prevents identical outputs from swapped patches.
    • Various methods can be used; all yield similar accuracy.

Building the ViT Network

  • Vectors Z1 to Zn represent patches after linear transformation and positional encoding.

  • CLS Token:

    • Used for classification, output by embedding layer as vector Z0.
  • Transformer Layers:

    • Sequence Z0 to Zn processed by multi-head self-attention layers and dense layers.
    • Optional addition of skip connections and normalization.
  • Output:

    • Final output vector C0 used for classification, fed into a Softmax classifier.

Training the Vision Transformer

  • Pre-training and Fine-tuning Steps:

    • Pre-train on large dataset A (e.g., 300 million images).
    • Fine-tune on smaller, target dataset B (e.g., ImageNet).
    • Evaluate based on test accuracy on dataset B.
  • Datasets:

    • Small ImageNet: 1.3 million images, 1000 classes.
    • Big ImageNet: 14 million images, 21,000 classes.
    • GFT (Google's private data): 300 million images, 18,000 classes.

Experimental Findings

  • ViT requires very large datasets for effective pre-training.
  • Outperforms ResNet with datasets over 100 million images.
  • Does not perform well with smaller datasets compared to ResNet.
  • Larger pre-training datasets yield better ViT performance.

Conclusion

  • Vision Transformer is advantageous over CNNs with sufficiently large datasets.
  • Has high data requirements for pre-training.
  • Links to the lecture slides are available below the video.