LLM Inference Efficiency Techniques

Jul 10, 2024

Lecture Notes on LLM Inference Efficiency

Introduction

  • Focus on speeding up inference techniques.
  • Not specific to any model.
  • Related to improving latency and throughput.
  • Topics covered include speculative decoding and KV cache management.

Regular Inference Procedures

  • Key Steps:
    1. Input token → embedding (x).
    2. Multiply with learned weight matrices (WK, WQ, WV).
    3. Compute K, Q, V.
    4. Compute attention using QK^T / sqrt(dimension) → softmax → multiply with V.
  • KV Cache: Stores intermediate results to avoid recomputation.
    • Grows as sequence length increases.
    • Managed through memory-efficient techniques.

KV Cache Management

  • Internal and External Fragmentation issues.
    • Internal: Due to reserve tokens' memory not fully utilized.
    • External: Caused by padding for variable sequence lengths.
  • Memory Allocation: Inefficient static allocation vs. dynamic allocation.
  • KV Cache Reimplementation:
    • Logical vs. Physical KV Blocks: Similar to virtual memory and physical memory.
    • Paged Attention: Breaks down K, V matrices into smaller, non-contiguous chunks.
    • Blocks grow as new tokens are added.

Improving Memory Utilization

  • Fragmentation Problems: High percentage of memory remains unused.
  • Use of block tables for mapping logical to physical memory blocks.
  • Paged Attention Implementation:
    • Modifications to traditional attention computations to handle blocks.
    • Helps in efficient use of memory and handling larger context lengths.

Flash Decoding

  • Derived from Flash Attention technique.
    • Flash Attention: Improve training through efficient memory usage.
    • Flash Decoding: Apply for inference to handle long context lengths efficiently.
  • Key Insight: Memory operations are bottlenecks, not computations.
  • Implementation:
    • Break down computations into smaller chunks (splits) for better GPU utilization.
    • Maintain performance even for larger context lengths.
  • Performance: More efficient on low batch sizes and longer prompts.

Look-Ahead Decoding (LAD)

  • Goal: Improve token acceptance rate by predicting multiple tokens.
  • Inspired by speculative decoding but without needing a draft model.
  • Jacobi Iteration Method: Apply iterative method used for systems of equations.
    • Initial guess for tokens + iteratively solve for better guesses.
    • High computational cost but improved latency.
    • Use of token acceptance verification and engrams for better prediction.
  • Visualization: Process involves initial guesses, iterative updates, verification.
  • Scaling: Larger N allows exponential scaling of window size and linear reduction in decoding steps.
  • Results: Applied on Llama Chat, showing speedups and latency reductions.

Conclusion

  • Efficiency Techniques:
    • Improved KV cache management and memory utilization.
    • Flash attention and flash decoding for inference speedups.
    • Look-ahead decoding for better token acceptance and reduced latency.
  • Key focus on reducing wasted computations and better GPU utilization.