šŸ“ˆ

Transformers for Time Series Forecasting

Mar 15, 2025

Time Series Forecasting with Transformers: Informer Architecture Lecture

Introduction

  • The lecture explores the application of Transformer architecture for time series forecasting using the Informer architecture.
  • Personal analogy: Using historical data for personal finance management.
  • The lecture is divided into three main parts: What, Why, and How of Time series forecasting with Transformers.

Transformer Network Overview

  • Components: Encoder and Decoder.
  • Originally for sequence-to-sequence problems (like language translation).
  • Process:
    • Input sequence (e.g., English words) passed to the encoder.
    • Encoder generates vector representations.
    • Decoder uses these vectors and a start token to generate the output sequence.

Adapting to Time Series Data: Informer Architecture

  • Forward Pass:
    • Time series data inputted to the encoder.
    • Encoder generates embedding vectors.
    • Embeddings passed to the decoder along with input samples for simultaneous timestamp data generation.

Key Differences from Original Transformer

  1. Encoder Vectors: Number may differ from input size.
  2. Decoder Input: Informer uses input samples instead of a start token.
  3. Output Generation: Informer generates data for all timestamps at once.

Challenges with Transformers in Time Series Data

  1. Quadratic Computation of Self-attention:

    • Self-attention compares all data points, leading to quadratic operations.
    • Informer uses prob sparse self-attention to reduce complexity from O(n²) to O(n log n).
  2. Memory Bottleneck:

    • Stacking encoder layers consumes significant memory.
    • Distillation process extracts active data points to reduce memory usage.
  3. Speed Plunge in Long Output Prediction:

    • Original Transformers predict one time step at a time.
    • Informer uses generative inference for simultaneous predictions.

Informer Architecture In-Depth

  • Encoding Process:

    • Multi-head prob sparse self-attention for detecting correlations.
    • Distillation reduces active data set size, facilitating easier layer stacking.
  • Decoding Process:

    • Utilizes a masked multi-head prob sparse attention.
    • Outputs generated through a generative inference process.

Training the Informer Model

  • Loss Calculation:
    • Mean squared error of predicted time steps is calculated.
    • Loss is backpropagated through the network for model updates.

Summary

  • Transformer Limitations: Self-attention's computation, memory bottleneck, and speed issues.
  • Informer Solutions:
    • Prob sparse attention for reduced computation.
    • Distillation for memory efficiency.
    • Generative inference for fast predictions.

Closing Remarks

  • Introduction to coding the Informer from scratch in future lectures.
  • Encouragement to review foundational Transformer concepts.

Quizzes: Engages students in identifying differences and advantages of the Informer over traditional Transformers.