Lecture Notes: Prompting, Instruction Fine-Tuning, and RLHF

Introduction

Lecturer: Jesse Moo, PhD student in CS Department (NLP group)
Topic: Prompting, Instruction Fine-Tuning, and RLHF (Reinforcement Learning from Human Feedback)
Relevance: Key concepts behind the training of modern chatbots like ChatGPT and Bing

Project Proposals: Due recently; mentors are being assigned.
Assignment 5: Due Friday at midnight; suggested tools include Colab, AWS, Azure, or Kaggle for GPU access.
Course Feedback Survey: Posted on Ed; due Sunday by 11:59 pm.

Increase in compute and data for LLMs over the years.
Pre-training helps LLMs learn features like syntax, co-reference, sentiment, and more.
LLMs act as rudimentary world models due to vast internet data.
Examples of abilities: math reasoning, code generation, medical text comprehension.

Zero-shot learning: Perform tasks without explicit training, e.g., question answering by predicting next token.
Few-shot learning: Specify tasks by giving example inputs/outputs, improving performance.
Models:
- GPT (2018): 117 million parameters, trained on books.
- GPT-2 (2019): 1.5 billion parameters, trained on 40GB web text.
- GPT-3 (2020): 175 billion parameters, enables few-shot learning.

Chain of Thought Prompting: Demonstrate reasoning steps in prompts to improve task performance.
Zero-shot Chain of Thought Prompting: Simple instructions like "let's think step by step" can improve results.
Prompt Engineering: Emerging field, involves constructing effective prompts for various tasks.

Objective: Align language models with user intent by fine-tuning on instruction-output pairs.
Data Sets: Large datasets like Supernatural Instructions (~1.6k tasks, 3 million examples) are used.
Evaluation Benchmarks: MMLU and Big Bench for assessing performance on diverse tasks.
Benefits: Generalizes to unseen tasks, smaller models can outperform larger models with fine-tuning.
Challenges: Expensive human data collection, creative/open-ended tasks, and token-level penalty issues.

Objective: Maximize expected reward (human preference) for language model outputs.
Method:
- Train a reward model to predict human preferences.
- Use policy gradient methods to optimize language model parameters.
- Include a penalty term to prevent divergence from pre-trained model.
Challenges: Human feedback is expensive, noisy, and miscalibrated; reward hacking; over-optimization.

Constitutional AI: Using AI feedback to critique and improve language model outputs.
Self-Improvement: Fine-tuning models on their own outputs, particularly for Chain of Thought reasoning.
Challenges: High data requirements, reward hacking, hallucination, and security (jailbreaking issues).

Current State: Instruction fine-tuning and RLHF have significantly improved LLM capabilities but still face challenges.
Future Work: Exploring safer and more efficient methods to align AI models with human values and preferences.
Open Questions: Addressing fundamental limitations like hallucination and data efficiency for RLHF.

Exciting Time: Fast-paced developments in LLM research, requiring continual updates and innovations.

End of Notes