Time Series Forecasting with Transformers: Informer Architecture Lecture
Introduction
- The lecture explores the application of Transformer architecture for time series forecasting using the Informer architecture.
- Personal analogy: Using historical data for personal finance management.
- The lecture is divided into three main parts: What, Why, and How of Time series forecasting with Transformers.
Transformer Network Overview
- Components: Encoder and Decoder.
- Originally for sequence-to-sequence problems (like language translation).
- Process:
- Input sequence (e.g., English words) passed to the encoder.
- Encoder generates vector representations.
- Decoder uses these vectors and a start token to generate the output sequence.
Adapting to Time Series Data: Informer Architecture
- Forward Pass:
- Time series data inputted to the encoder.
- Encoder generates embedding vectors.
- Embeddings passed to the decoder along with input samples for simultaneous timestamp data generation.
Key Differences from Original Transformer
- Encoder Vectors: Number may differ from input size.
- Decoder Input: Informer uses input samples instead of a start token.
- Output Generation: Informer generates data for all timestamps at once.
Challenges with Transformers in Time Series Data
-
Quadratic Computation of Self-attention:
- Self-attention compares all data points, leading to quadratic operations.
- Informer uses prob sparse self-attention to reduce complexity from O(n²) to O(n log n).
-
Memory Bottleneck:
- Stacking encoder layers consumes significant memory.
- Distillation process extracts active data points to reduce memory usage.
-
Speed Plunge in Long Output Prediction:
- Original Transformers predict one time step at a time.
- Informer uses generative inference for simultaneous predictions.
Informer Architecture In-Depth
-
Encoding Process:
- Multi-head prob sparse self-attention for detecting correlations.
- Distillation reduces active data set size, facilitating easier layer stacking.
-
Decoding Process:
- Utilizes a masked multi-head prob sparse attention.
- Outputs generated through a generative inference process.
Training the Informer Model
- Loss Calculation:
- Mean squared error of predicted time steps is calculated.
- Loss is backpropagated through the network for model updates.
Summary
- Transformer Limitations: Self-attention's computation, memory bottleneck, and speed issues.
- Informer Solutions:
- Prob sparse attention for reduced computation.
- Distillation for memory efficiency.
- Generative inference for fast predictions.
Closing Remarks
- Introduction to coding the Informer from scratch in future lectures.
- Encouragement to review foundational Transformer concepts.
Quizzes: Engages students in identifying differences and advantages of the Informer over traditional Transformers.