🧠

Understanding Large Language Models (LLMs)

Dec 18, 2024

Lecture: Building Large Language Models (LLMs)

Introduction

  • Focus on building and understanding Large Language Models (LLMs).
  • Examples of LLMs: ChatGPT, Cloud, Gemini, Lama.
  • Overview of key components: architecture, training loss, training algorithm, data, evaluation, and systems.

Key Components of LLMs

Architecture

  • LLMs are neural networks, specifically based on transformers.
  • Discussion on architecture is limited due to previous lectures and availability of online resources.

Training Loss and Algorithm

  • Importance of how models are trained.
  • Discusses pre-training (modeling language) and post-training (AI assistants).

Data

  • Critical for training models.
  • Involves large internet datasets and careful curation/cleaning.

Evaluation

  • Evaluation through perplexity and NLP benchmarks.
  • Challenges in evaluation are acknowledged.

Systems

  • Facilitates running models on modern hardware.
  • Important due to the large size of models.

Pre-Training

Language Modeling

  • LLMs model probability distribution over sequences of words.
  • Autoregressive models predict the next word using probability chain rule.

Generative Models

  • Language models as generative models can generate text.

Tokenization

  • Tokenizers handle text segmentation into tokens.
  • Importance of tokenizers: generalizing beyond words, handling typos, and optimizing sequence length.

Evaluation of LLMs

Perplexity

  • Measures model efficiency; depends on tokenizer and dataset.
  • Not typically used in academic benchmarking but important in development.

Academic Benchmarks

  • HELM and Hugging Face Open LM Leaderboard for evaluating models across various tasks.

Evaluation Challenges

  • Inconsistency and train-test contamination in benchmarks.
  • Handling open-ended generations.

Data Collection and Processing

  • Involves web crawling and filtering for quality data.
  • Steps include text extraction, removal of duplicates, and heuristic filtering.
  • Synthetic data and multimodal data as advancing areas.

Scaling Laws

  • Larger models and more data lead to better performance.
  • Scaling laws predict model performance improvements.
  • Chinchilla paper findings on optimal training resource allocation.

Post-Training

Supervised Fine-Tuning (SFT)

  • Fine-tuning models on desired human-generated answers.
  • SFT data requirements are modest.

Reinforcement Learning from Human Feedback (RLHF)

  • Overcomes limitations of SFT by maximizing human preference.
  • Involves creating a reward model to evaluate and improve output.

Evaluation of Post-Training

  • Challenges due to open-ended answers and lack of perplexity use.
  • Human evaluations and LLM-assisted evaluations (e.g., Chatbot Arena).

Miscellaneous

Systems Optimization

  • Low precision computations and operator fusion to optimize GPU usage.

Future Topics

  • Unexplored areas include architecture advancements, inference optimizations, and legalities of data usage.

Further Reading

  • Recommendations for related courses: CS224N, CS324, and CS336 for deeper understanding.