🧬

Understanding the Biology of Language Models

May 10, 2025

Lecture on the Biology of Large Language Models

Introduction

Focus on the internal workings of transformer language models.
Explore capabilities that emerge without explicit programming.
Comparison to a biological problem where internal mechanisms are unknown.

Anthropic's Investigation

Anthropic explores the internal circuits of language models.
Use of a method called circuit tracing to understand model features.
Critique of Anthropic's framing of their findings.

Circuit Tracing Methodology

Involves the creation of a replacement model using a transcoder.
The transcoder is trained to match the output of each transformer layer.
Generates attribution graphs showing feature contributions.

Construction of the Transcoder

Focus on multilayer perceptron (MLP) features, retaining attention layers.
Transcoder gets outputs from all previous layers, unlike regular MLPs.
Trained with sparsity penalties for clearer feature interpretation.
Encourages use of independent, non-overlapping features.

Use of Attribution Graphs

Depict which features are activated and their contributions.
Examples include acronym completion and multi-step reasoning.
Allows visualization of model reasoning and decision-making processes.

Multi-step Reasoning Example

Explores how models solve multi-step problems (e.g., identifying the capital of a state).
Reveals internal paths for reasoning and word association shortcuts.

Poetry and Rhymes

Investigation into how models plan rhymes in poetry.
Models tend to plan rhymes early rather than improvising.
Demonstrates planning features in the model’s processing.

Multilingual Circuitry

Examines how language models manage multilingual tasks.
Finds early language-specific features but more agnostic features in middle layers.
Suggests an internal bias towards English due to training data.
Middle layers show higher overlap in multilingual feature activation.

Implications and Observations

Models show features of planning and reasoning.
Internal multilingual processing with a bias towards English.
Insights into better design and training of future models.

Conclusion

Anthropic provides methods to peek inside transformer models.
Insights can help improve language model design and understanding.

This lecture provided a detailed look at how large language models process information and make decisions. The methods discussed offer ways to interpret complex model behavior, shedding light on the intricate inner workings of these AI systems.

Full transcript