Lecture Notes: Vision Transformers and CNNs Comparison

Introduction

Topic: Quick guide to Vision Transformers
Goal: Provide summative information and distinctions between Vision Transformers and CNNs.

Architecture: Extension of the Transformer for image data
- Focus on the encoder part, not the decoder
Origins: Introduced by Google Research in the paper "An Image is Worth 16x16 Words" (2020)

Embeddings: Transform inputs into numeric vectors.
- Image transformed into 16x16 pixel tiles.
- Each patch passed through a fully connected neural network to get embeddings.
Patching: Efficiently rearrange the image into patches.
- Use of einops package for reshaping multi-dimensional arrays and tensors.

Purpose: Make the model understand the position of each patch.
Details: Learnable parameter Vector per patch, added on top of the input embedding vectors.

CLS Token: Classification token used to gather all information for a single representation.
- Filled with information from all other inputs.
Transformer Encoder: Repeated N times, consists of:
- Multi-head Attention: Allows information sharing between inputs.
- Layer Normalization: Normalizes inputs in a layer for each sample.
- Feed Forward Network: Linear layer transforming attention-weighted vectors.
- Residual Connections: Improve the flow of information (avoid Vanishing gradients).

Implementation involves patch embedding, and passing the image through the Transformer layers.
Familiarize yourself with DSL libraries to facilitate implementation.
Some key blocks: Multi-headed attention wrapped in normalization, feed-forward blocks, residual blocks.

Inductive Bias: Flexible, no strong biases.
Data Efficiency: More data hungry, prefers large datasets (millions of images).
Learning: Global learning, access to all image components.
Interpretability: Easier, possible to visualize attention weights (attention map).

Hierarchy: Produces a hierarchical representation by merging patches as network deepens.
Efficiency: Applies attention in windows, which is more efficient.

Efficiency: Makes architecture efficient by using knowledge distillation.
- Uses a teacher model (typically CNN) for improving small dataset performance.