📉

Low Rank Adaptation for Fine-Tuning Large Models

Jul 11, 2024

Lecture Notes: Low Rank Adaptation for Fine-Tuning Large Models

Introduction

Topic: Low Rank Adaptation (LORA)
Context: Fine-tuning very large deep learning models
Applications: Large language models (e.g., GPT-4 with 1.8 trillion parameters)
Challenges:
- Tuning many parameters is time-consuming
- Massive GPU requirements

Precision and Quantization

Weight Matrices & Data Types

Weight Matrices: Composed of floating-point numbers, typically float32
Precision Types:
- float32 (32 bits)
- Half Precision (float16)

Lowering Precision

Trade-offs: Lower memory usage at the cost of precision
Effects:
- Loss and rounding errors can accumulate

Memory Calculation

Formula: Data type size * Number of weights
Additional Memory: Required for storing gradients and learning rates
Example: BLOOM (176 billion parameters) requires ~350GB for inference*

Mixed Precision

Concept: Different parts of the network use different data types
Performance Impact: Varies depending on the dataset; typically reasonable

Quantization

Definition: Reducing precision of model weights, even to integer levels (int8)
Advantage: Maintains model performance despite lower precision
Techniques: Various methods exist for quantization
Optimal Precision: Recent papers suggest that 4-bit quantization is near optimal

GPU Performance

FLOPS: Floating Point Operations Per Second metric
Small Precision Models: Faster training as they require less memory and compute time

Parameter-Efficient Fine-Tuning Techniques

Traditional Transfer Learning

Method: Freeze all weights, add task-specific head
Limitation: Only utilizes output embeddings

Adapter Layers

Concept: Add new modules between existing layers
Pros/Cons: Improved access to model's internal representations but increase latency

Prefix Tuning

Method: Optimizes input vectors (prefixes) for language models
Limitation: Limited control over model behavior

Low-Rank Adaptation (LORA)

Concept: Fine-tune with a subset of parameters by matrix decomposition
Motivation: Efficient parameter tuning could match full space tuning
Application: Typically applied to Attention weights in Transformers

LORA in Detail

Matrix Rank

Definition: Number of independent row or column vectors
Importance: Captures essential data properties
Low-Rank Matrices: Compact representation, reduced complexity

LORA Process

Foundational Paper: Motivated by 2021 Facebook research
Process: Decomposition of weight updates (delta W) via product of low-rank matrices B and A
Initialization: B=0, A initialized from normal distribution

Efficiency

Implications: Efficient training by focusing on B and A matrices
Implementation: Adds decomposition result to output of original frozen model
Scaling Factor: Balances changes vs. original model weights

Hyperparameters

Rank (r): Intrinsic dimensionality, typically ranges from 1-64
Alpha (α): Controls weight update magnitude, balances original and altered weights
Optimal Rank: Depends on dataset; empirical selection needed

Benefits

Reduced Computational Requirements: Efficient training
Performance: Maintains high performance with fewer parameters
Modularity: Easy to switch LORA weights for different tasks

Practical Implementation

Library: HuggingFace peft library
Functions: get_peft_model for easy application
Hyperparameters Configuration: Specify target modules, alpha, and rank
Parameters: Reports small percentage of trainable parameters (e.g., 0.19%)
Quantization Combination: Can combine LORA with quantization techniques (e.g., QLoRA with 4-bit quantization)

Conclusion

Summary: Efficient and practical method for fine-tuning large models
Future work: Potential further research in combining LORA with quantization and other techniques

Full transcript