Coconote
AI notes
AI voice & video notes
Try for free
🖼️
Understanding Vision Transformers in Computer Vision
Aug 12, 2024
Notes on Vision Transformer Lecture
Introduction
Speaker: Rueda Tiense, Professor at the University of the Philippines
Topic: Recent breakthrough in computer vision: Vision Transformer (VIVE)
Outline:
Importance of Vision Transformer
Concept of Attention
Model Architecture of Vision Transformer
Applications
Limitations and Future Directions
Why Vision Transformer Matters
Transformer Architecture
:
High-capacity network architecture capable of approximating complex functions.
Dominant in natural language processing (NLP) but less prevalent in computer vision (CV).
CNNs have been the traditional model for CV.
Breakthrough
:
Google researchers made transformers applicable to vision.
Vision Transformer outperforms CNN-based models in various tasks (recognition, detection, segmentation).
General-purpose Architecture
:
Can process different data formats (text, audio, image, video).
Promotes multimodal learning, reflecting the inherently multimodal nature of the world.
Attention Mechanism
Concept
:
Attention highlights relevant features in images.
Example with a bird photo: Patches from the bird have high attention; background patches have low attention.
Mathematical Expression
:
Patches converted into high-dimensional feature vectors.
Attention computed using dot products between these vectors.
NLP Perspective
:
High attention between related words (e.g., "brown" and "fox"); low attention otherwise (e.g., "brown" and "dog").
Building the Vision Transformer Model
Image Pre-processing
Image to Patches
:
Input image is split into patches (9 used in example, 196 for ImageNet).
Linear Projection
:
Each patch reshaped into a 1D vector, multiplied by a weight matrix to produce feature vectors.
Can be performed by dense layers or strided convolution.
Position Embedding
:
Necessary to provide the model with positional information, as transformers lack the notion of equivariance.
Different algorithms available (sinusoidal, learnable embeddings, rotary embedding).
Encoder Module
Encoder Structure
:
Composed of self-attention and MLP blocks.
Attention Function
:
Defined by query, key, and value.
Uses softmax to convert normalized dot products into probabilities.
Improvements
:
Layer normalization stabilizes training.
Skip connections enhance performance by propagating representations across layers.
Multi-head self-attention improves representations by capturing diverse features.
Model Architecture Summary
Overall structure from input image to linear projection, through stacked encoder modules, to MLP head for classification.
Different Versions of Vision Transformer
Model Sizes
:
BASE: 12 encoder layers, 768 hidden size, 3,072 MLP size, 12 heads (86 million parameters).
Smaller versions: Small (21 million parameters), Tiny (5 million parameters).
Training
:
Pre-train on large datasets like JFT300M before fine-tuning on specific tasks.
Performance improves significantly with larger datasets.
Applications and Performance
Segmentation Models
:
SET-R and SegFormer utilizing vision transformers for state-of-the-art results.
Hybrid Architectures
:
Models like TransUnet and VidSTR combining CNN and transformer approaches for improved performance.
Limitations and Future Directions
Challenges
:
Quadratic cost of computing attention.
High parameter count making it challenging for resource-constrained environments.
Future Research
:
Investigate hybrid architectures and depthwise convolutions in attention.
Broader Implications
Transformers as general-purpose networks applicable to various data types.
Perceiver.io demonstrates training capabilities without strict input/output structures.
Implementation
Open-source implementations available for practical use.
Simplified creation of Vision Transformer architectures using libraries like PyTorch.
Closing
Speaker's GitHub profile contains open-source implementations and lecture notes.
📄
Full transcript