Coconote
AI notes
AI voice & video notes
Try for free
⚡
LLM Inference Efficiency Techniques
Jul 10, 2024
📄
View transcript
🃏
Review flashcards
Lecture Notes on LLM Inference Efficiency
Introduction
Focus on speeding up inference techniques.
Not specific to any model.
Related to improving latency and throughput.
Topics covered include speculative decoding and KV cache management.
Regular Inference Procedures
Key Steps
:
Input token → embedding (x).
Multiply with learned weight matrices (WK, WQ, WV).
Compute K, Q, V.
Compute attention using
QK^T / sqrt(dimension)
→ softmax → multiply with V.
KV Cache
: Stores intermediate results to avoid recomputation.
Grows as sequence length increases.
Managed through memory-efficient techniques.
KV Cache Management
Internal and External Fragmentation
issues.
Internal
: Due to reserve tokens' memory not fully utilized.
External
: Caused by padding for variable sequence lengths.
Memory Allocation
: Inefficient static allocation vs. dynamic allocation.
KV Cache Reimplementation
:
Logical vs. Physical KV Blocks
: Similar to virtual memory and physical memory.
Paged Attention
: Breaks down K, V matrices into smaller, non-contiguous chunks.
Blocks grow as new tokens are added.
Improving Memory Utilization
Fragmentation Problems
: High percentage of memory remains unused.
Use of block tables for mapping logical to physical memory blocks.
Paged Attention Implementation
:
Modifications to traditional attention computations to handle blocks.
Helps in efficient use of memory and handling larger context lengths.
Flash Decoding
Derived from Flash Attention technique.
Flash Attention: Improve training through efficient memory usage.
Flash Decoding: Apply for inference to handle long context lengths efficiently.
Key Insight
: Memory operations are bottlenecks, not computations.
Implementation
:
Break down computations into smaller chunks (splits) for better GPU utilization.
Maintain performance even for larger context lengths.
Performance
: More efficient on low batch sizes and longer prompts.
Look-Ahead Decoding (LAD)
Goal
: Improve token acceptance rate by predicting multiple tokens.
Inspired by speculative decoding but without needing a draft model.
Jacobi Iteration Method
: Apply iterative method used for systems of equations.
Initial guess for tokens + iteratively solve for better guesses.
High computational cost but improved latency.
Use of token acceptance verification and engrams for better prediction.
Visualization
: Process involves initial guesses, iterative updates, verification.
Scaling
: Larger N allows exponential scaling of window size and linear reduction in decoding steps.
Results
: Applied on Llama Chat, showing speedups and latency reductions.
Conclusion
Efficiency Techniques
:
Improved KV cache management and memory utilization.
Flash attention and flash decoding for inference speedups.
Look-ahead decoding for better token acceptance and reduced latency.
Key focus on reducing wasted computations and better GPU utilization.
📄
Full transcript