🖼️

Lecture Notes: Vision Transformers and CNNs Comparison

Jul 10, 2024

Lecture Notes: Vision Transformers and CNNs Comparison

Introduction

  • Topic: Quick guide to Vision Transformers
  • Goal: Provide summative information and distinctions between Vision Transformers and CNNs.

Vision Transformer (ViT) Overview

  • Architecture: Extension of the Transformer for image data
    • Focus on the encoder part, not the decoder
  • Origins: Introduced by Google Research in the paper "An Image is Worth 16x16 Words" (2020)

Key Components of ViT

1. Encoder Part

  • Embeddings: Transform inputs into numeric vectors.
    • Image transformed into 16x16 pixel tiles.
    • Each patch passed through a fully connected neural network to get embeddings.
  • Patching: Efficiently rearrange the image into patches.
    • Use of einops package for reshaping multi-dimensional arrays and tensors.

2. Position Embeddings

  • Purpose: Make the model understand the position of each patch.
  • Details: Learnable parameter Vector per patch, added on top of the input embedding vectors.

3. Additional Components

  • CLS Token: Classification token used to gather all information for a single representation.
    • Filled with information from all other inputs.
  • Transformer Encoder: Repeated N times, consists of:
    • Multi-head Attention: Allows information sharing between inputs.
    • Layer Normalization: Normalizes inputs in a layer for each sample.
    • Feed Forward Network: Linear layer transforming attention-weighted vectors.
    • Residual Connections: Improve the flow of information (avoid Vanishing gradients).

Implementing ViT

  • Implementation involves patch embedding, and passing the image through the Transformer layers.
  • Familiarize yourself with DSL libraries to facilitate implementation.
  • Some key blocks: Multi-headed attention wrapped in normalization, feed-forward blocks, residual blocks.

Differences between ViT and CNN

CNNs

  • Inductive Bias: Strong, due to translation invariance from sliding kernels.
  • Data Efficiency: Less data hungry compared to ViT.
  • Learning: Hierarchical learning through growing receptive fields.

ViTs

  • Inductive Bias: Flexible, no strong biases.
  • Data Efficiency: More data hungry, prefers large datasets (millions of images).
  • Learning: Global learning, access to all image components.
  • Interpretability: Easier, possible to visualize attention weights (attention map).

Recommendations

  • Use CNNs for fewer data points, ViTs if you have access to large datasets.

Extensions and Variants

Swin Transformer

  • Hierarchy: Produces a hierarchical representation by merging patches as network deepens.
  • Efficiency: Applies attention in windows, which is more efficient.

Data Efficient Image Transformer (DeiT)

  • Efficiency: Makes architecture efficient by using knowledge distillation.
    • Uses a teacher model (typically CNN) for improving small dataset performance.

More Variants

  • Resources: GitHub collection available for various other models.

Conclusion

  • Overview: Summarized key concepts and implementations of Vision Transformers.
  • Practical Note: Adapt based on dataset size and specific needs.