Lecture on Large Language Models (LLMs) and Context Windows

Understanding the Context Window

Definition: Equivalent to working memory in LLMs, determining how long a conversation can be maintained without forgetting earlier details.
Functionality: Captures entire conversations within a window, allowing the model to reference previous exchanges for generating responses.
Limitations: When a conversation thread exceeds the context window, earlier details are forgotten, leading to potential inaccuracies.

Tokens: Unlike humans who use characters, AI models use tokens as the smallest unit of language.
- Tokens can be individual characters, parts of words, whole words, or short phrases.
- Example: "a" in "Martin drove a car" is one token, but "a" in "amoral" is two tokens ("a" and "moral").
Tokenization: Process of converting language to tokens using a tokenizer.
- Different tokenizers may produce varying results for the same text.
- A regular English word averages 1.5 tokens.

Self-Attention Mechanism: Utilized by transformer models to determine relationships and dependencies between tokens.
- Relevance of tokens is computed via weight vectors.
Window Size Increases: Early LLMs had around 2,000 tokens; newer models like IBM Granite 3 have 128,000 tokens.

User Input and Model Responses: Both contribute to filling the context window.
System Prompts: Hidden instructions conditioning model behavior.
Supplementary Information: Documents, source code, and external data sources can be included for augmented generation.

Increased Compute Requirements: Processing needs scale quadratically with sequence length.
- Doubling input tokens requires four times more processing power.
Performance Issues: Models may struggle with information in the middle of long contexts, leading to cognitive shortcuts.
- Performance is best when relevant info is at the start or end of the context.
Safety Concerns: Longer windows increase vulnerability to adversarial prompts and jailbreaking.
- Embedded malicious content is harder to detect.

Balancing Act: Selecting the right number of tokens involves balancing information needs with computational demands and potential performance issues.