Axel Fine-Tuning Techniques Overview

Aug 11, 2024

Lecture Notes on Axel and Fine-Tuning Techniques

Plan for Today

  • Discussion on Axel and its broad usage
  • Review of the Honeycomb example introduced previously
  • Interactive Q&A session
  • Overview by Zack on parallelism and Hugging Face Accelerate
  • Quick run-through of fine-tuning on Modal
  • Closing Q&A session

Model Capacity Questions

Key Questions When Fine-Tuning

  1. What model should I fine-tune off of?
    • Model size: 7B, 13B, 70B, etc.
    • Model family: Llama 2, Llama 3, Mistral, etc.
  2. Should I use LoRA or full fine-tune?
    • Recommendation: Use LoRA for efficiency unless specific circumstances justify a full fine-tune.

Model Size Insights

  • Experience with Different Sizes:
    • 7B and 13B models often yield comparable results.
    • 70B models can be complex to manage and require more parallelism.
    • 7B models are popular and easier to work with, especially when sufficient.

Choosing a Model Family

  • Use recent models like Llama 3 for good performance.
  • Check platforms like Hugging Face to find trending models.
  • Running multiple models for comparison can be beneficial, but often the most popular models suffice.

LoRA vs. Full Fine-Tuning

  • Understanding LoRA:

    • LoRA involves adding lower-dimensional matrices to the original weight matrices, significantly reducing the number of parameters needed (128,000 vs. 16 million).
    • LoRA generally requires less GPU RAM and is easier to implement.
    • Practitioners are encouraged to use LoRA initially with potential future full fine-tunes if needed.
  • Quantized LoRA (QLoRA):

    • QLoRA reduces memory usage further by quantizing weights, leading to memory savings.
    • Commonly used but may not drastically impact performance as expected.

Importance of Data Quality

  • Emphasis on Data Improvement:
    • Prioritize improving data quality over over-optimizing model parameters.
    • Enhancing data can lead to significant performance gains.

Getting Started with Axel

Initial Steps

  • Visit GitHub Axel repository for examples and quick-start documentation.
  • Work with YAML config files and modify existing examples to fit your dataset and needs.

Configuring Data Sets

  • Axel supports various data formats; specify the format correctly for best results.
  • Sample data formats and examples for training need careful preparation.

Training Process in Axel

  1. Run the Pre-Processing Command: Ensure data is in the correct format.
  2. Run the Training Command: Start training based on the configured settings.
  3. Sanity Check the Model: Verify output using Hugging Face or Axel directly.

Honeycomb Case Study Overview

  • Use Case: Honeycomb is an observability platform allowing natural language queries instead of using a specific query language.
  • Focus on fine-tuning models to improve user query effectiveness.
  • Data evaluation is critical and includes unit tests and A/B testing.
  • Synthetic data generation is a method used to expand training datasets when real examples are limited.

Custom Evaluations

  • Writing evaluations for performance during training can be done through custom configurations.
  • There are various levels of evaluation from unit tests to A/B testing.

Debugging and Troubleshooting in Axel

  • Best practices include:
    • Use the latest version of Axel.
    • Reduce complexity by minimizing concurrency.
    • Regularly clear caches to avoid unexpected behavior.

Scaling Model Training with Hugging Face Accelerate (by Zack)

Understanding GPU Usage

  • Different models require varying amounts of GPU resources for training.
  • Distributed training techniques can manage resources more effectively across multiple GPUs.

Fully Sharded Data Parallelism (FSDP)

  • FSDP allows splitting model parameters across multiple GPUs, enabling training of larger models than a single GPU can handle.
  • Important strategies include sharding model states and parameters to optimize memory usage.

Using Accelerate with Axel

  • Configuration involves defining environment settings and memory estimations to ensure efficient training.
  • Key commands include accelerate config and accelerate launch.

Hyperparameter Tuning with Modal

  • Modal is a cloud-native platform that simplifies running Python code for model training.
  • Useful for hyperparameter tuning, provides real-time feedback and iterative development.

Final Thoughts and Recommendations

  • Focus on data quality and structuring before diving deep into hyperparameter optimization.
  • Regularly leverage tools like weights and biases for logging performance and metrics.
  • Stay updated with community resources and documentation to maximize the use of Axel and other frameworks.