Lecture Notes on LLM Inference Efficiency

Introduction

Key Steps:
1. Input token → embedding (x).
2. Multiply with learned weight matrices (WK, WQ, WV).
3. Compute K, Q, V.
4. Compute attention using QK^T / sqrt(dimension) → softmax → multiply with V.
KV Cache: Stores intermediate results to avoid recomputation.
- Grows as sequence length increases.
- Managed through memory-efficient techniques.

Internal and External Fragmentation issues.
- Internal: Due to reserve tokens' memory not fully utilized.
- External: Caused by padding for variable sequence lengths.
Memory Allocation: Inefficient static allocation vs. dynamic allocation.
KV Cache Reimplementation:
- Logical vs. Physical KV Blocks: Similar to virtual memory and physical memory.
- Paged Attention: Breaks down K, V matrices into smaller, non-contiguous chunks.
- Blocks grow as new tokens are added.

Fragmentation Problems: High percentage of memory remains unused.
Use of block tables for mapping logical to physical memory blocks.
Paged Attention Implementation:
- Modifications to traditional attention computations to handle blocks.
- Helps in efficient use of memory and handling larger context lengths.

Derived from Flash Attention technique.
- Flash Attention: Improve training through efficient memory usage.
- Flash Decoding: Apply for inference to handle long context lengths efficiently.
Key Insight: Memory operations are bottlenecks, not computations.
Implementation:
- Break down computations into smaller chunks (splits) for better GPU utilization.
- Maintain performance even for larger context lengths.
Performance: More efficient on low batch sizes and longer prompts.

Goal: Improve token acceptance rate by predicting multiple tokens.
Inspired by speculative decoding but without needing a draft model.
Jacobi Iteration Method: Apply iterative method used for systems of equations.
- Initial guess for tokens + iteratively solve for better guesses.
- High computational cost but improved latency.
- Use of token acceptance verification and engrams for better prediction.
Visualization: Process involves initial guesses, iterative updates, verification.
Scaling: Larger N allows exponential scaling of window size and linear reduction in decoding steps.
Results: Applied on Llama Chat, showing speedups and latency reductions.

Efficiency Techniques:
- Improved KV cache management and memory utilization.
- Flash attention and flash decoding for inference speedups.
- Look-ahead decoding for better token acceptance and reduced latency.
Key focus on reducing wasted computations and better GPU utilization.