Time Series Forecasting with Transformers: Informer Architecture Lecture

Introduction

The lecture explores the application of Transformer architecture for time series forecasting using the Informer architecture.
Personal analogy: Using historical data for personal finance management.
The lecture is divided into three main parts: What, Why, and How of Time series forecasting with Transformers.

Components: Encoder and Decoder.
Originally for sequence-to-sequence problems (like language translation).
Process:
- Input sequence (e.g., English words) passed to the encoder.
- Encoder generates vector representations.
- Decoder uses these vectors and a start token to generate the output sequence.

Forward Pass:
- Time series data inputted to the encoder.
- Encoder generates embedding vectors.
- Embeddings passed to the decoder along with input samples for simultaneous timestamp data generation.

Quadratic Computation of Self-attention:
- Self-attention compares all data points, leading to quadratic operations.
- Informer uses prob sparse self-attention to reduce complexity from O(n²) to O(n log n).
Memory Bottleneck:
- Stacking encoder layers consumes significant memory.
- Distillation process extracts active data points to reduce memory usage.
Speed Plunge in Long Output Prediction:
- Original Transformers predict one time step at a time.
- Informer uses generative inference for simultaneous predictions.

Encoding Process:
- Multi-head prob sparse self-attention for detecting correlations.
- Distillation reduces active data set size, facilitating easier layer stacking.
Decoding Process:
- Utilizes a masked multi-head prob sparse attention.
- Outputs generated through a generative inference process.

Loss Calculation:
- Mean squared error of predicted time steps is calculated.
- Loss is backpropagated through the network for model updates.

Transformer Limitations: Self-attention's computation, memory bottleneck, and speed issues.
Informer Solutions:
- Prob sparse attention for reduced computation.
- Distillation for memory efficiency.
- Generative inference for fast predictions.

Quizzes: Engages students in identifying differences and advantages of the Informer over traditional Transformers.