📉

Low Rank Adaptation for Fine-Tuning Large Models

Jul 11, 2024

Lecture Notes: Low Rank Adaptation for Fine-Tuning Large Models

Introduction

  • Topic: Low Rank Adaptation (LORA)
  • Context: Fine-tuning very large deep learning models
  • Applications: Large language models (e.g., GPT-4 with 1.8 trillion parameters)
  • Challenges:
    • Tuning many parameters is time-consuming
    • Massive GPU requirements

Precision and Quantization

Weight Matrices & Data Types

  • Weight Matrices: Composed of floating-point numbers, typically float32
  • Precision Types:
    • float32 (32 bits)
    • Half Precision (float16)

Lowering Precision

  • Trade-offs: Lower memory usage at the cost of precision
  • Effects:
    • Loss and rounding errors can accumulate

Memory Calculation

  • Formula: Data type size * Number of weights
  • Additional Memory: Required for storing gradients and learning rates
  • Example: BLOOM (176 billion parameters) requires ~350GB for inference*

Mixed Precision

  • Concept: Different parts of the network use different data types
  • Performance Impact: Varies depending on the dataset; typically reasonable

Quantization

  • Definition: Reducing precision of model weights, even to integer levels (int8)
  • Advantage: Maintains model performance despite lower precision
  • Techniques: Various methods exist for quantization
  • Optimal Precision: Recent papers suggest that 4-bit quantization is near optimal

GPU Performance

  • FLOPS: Floating Point Operations Per Second metric
  • Small Precision Models: Faster training as they require less memory and compute time

Parameter-Efficient Fine-Tuning Techniques

Traditional Transfer Learning

  • Method: Freeze all weights, add task-specific head
  • Limitation: Only utilizes output embeddings

Adapter Layers

  • Concept: Add new modules between existing layers
  • Pros/Cons: Improved access to model's internal representations but increase latency

Prefix Tuning

  • Method: Optimizes input vectors (prefixes) for language models
  • Limitation: Limited control over model behavior

Low-Rank Adaptation (LORA)

  • Concept: Fine-tune with a subset of parameters by matrix decomposition
  • Motivation: Efficient parameter tuning could match full space tuning
  • Application: Typically applied to Attention weights in Transformers

LORA in Detail

Matrix Rank

  • Definition: Number of independent row or column vectors
  • Importance: Captures essential data properties
  • Low-Rank Matrices: Compact representation, reduced complexity

LORA Process

  • Foundational Paper: Motivated by 2021 Facebook research
  • Process: Decomposition of weight updates (delta W) via product of low-rank matrices B and A
  • Initialization: B=0, A initialized from normal distribution

Efficiency

  • Implications: Efficient training by focusing on B and A matrices
  • Implementation: Adds decomposition result to output of original frozen model
  • Scaling Factor: Balances changes vs. original model weights

Hyperparameters

  • Rank (r): Intrinsic dimensionality, typically ranges from 1-64
  • Alpha (α): Controls weight update magnitude, balances original and altered weights
  • Optimal Rank: Depends on dataset; empirical selection needed

Benefits

  • Reduced Computational Requirements: Efficient training
  • Performance: Maintains high performance with fewer parameters
  • Modularity: Easy to switch LORA weights for different tasks

Practical Implementation

  • Library: HuggingFace peft library
  • Functions: get_peft_model for easy application
  • Hyperparameters Configuration: Specify target modules, alpha, and rank
  • Parameters: Reports small percentage of trainable parameters (e.g., 0.19%)
  • Quantization Combination: Can combine LORA with quantization techniques (e.g., QLoRA with 4-bit quantization)

Conclusion

  • Summary: Efficient and practical method for fine-tuning large models
  • Future work: Potential further research in combining LORA with quantization and other techniques