Coconote
AI notes
AI voice & video notes
Try for free
🖼️
Lecture Notes: Vision Transformers and CNNs Comparison
Jul 10, 2024
Lecture Notes: Vision Transformers and CNNs Comparison
Introduction
Topic
: Quick guide to Vision Transformers
Goal
: Provide summative information and distinctions between Vision Transformers and CNNs.
Vision Transformer (ViT) Overview
Architecture
: Extension of the Transformer for image data
Focus on the
encoder
part, not the decoder
Origins
: Introduced by Google Research in the paper "An Image is Worth 16x16 Words" (2020)
Key Components of ViT
1. Encoder Part
Embeddings
: Transform inputs into numeric vectors.
Image transformed into 16x16 pixel tiles.
Each patch passed through a fully connected neural network to get embeddings.
Patching
: Efficiently rearrange the image into patches.
Use of
einops
package for reshaping multi-dimensional arrays and tensors.
2. Position Embeddings
Purpose
: Make the model understand the position of each patch.
Details
: Learnable parameter Vector per patch, added on top of the input embedding vectors.
3. Additional Components
CLS Token
: Classification token used to gather all information for a single representation.
Filled with information from all other inputs.
Transformer Encoder
: Repeated N times, consists of:
Multi-head Attention: Allows information sharing between inputs.
Layer Normalization: Normalizes inputs in a layer for each sample.
Feed Forward Network: Linear layer transforming attention-weighted vectors.
Residual Connections: Improve the flow of information (avoid Vanishing gradients).
Implementing ViT
Implementation involves patch embedding, and passing the image through the Transformer layers.
Familiarize yourself with DSL libraries to facilitate implementation.
Some key blocks: Multi-headed attention wrapped in normalization, feed-forward blocks, residual blocks.
Differences between ViT and CNN
CNNs
Inductive Bias
: Strong, due to translation invariance from sliding kernels.
Data Efficiency
: Less data hungry compared to ViT.
Learning
: Hierarchical learning through growing receptive fields.
ViTs
Inductive Bias
: Flexible, no strong biases.
Data Efficiency
: More data hungry, prefers large datasets (millions of images).
Learning
: Global learning, access to all image components.
Interpretability
: Easier, possible to visualize attention weights (attention map).
Recommendations
Use CNNs for fewer data points, ViTs if you have access to large datasets.
Extensions and Variants
Swin Transformer
Hierarchy
: Produces a hierarchical representation by merging patches as network deepens.
Efficiency
: Applies attention in windows, which is more efficient.
Data Efficient Image Transformer (DeiT)
Efficiency
: Makes architecture efficient by using knowledge distillation.
Uses a teacher model (typically CNN) for improving small dataset performance.
More Variants
Resources
: GitHub collection available for various other models.
Conclusion
Overview
: Summarized key concepts and implementations of Vision Transformers.
Practical Note
: Adapt based on dataset size and specific needs.
📄
Full transcript