LRMs vs LLMs in Reasoning Tasks

Overview

This lecture examines the strengths and limitations of Large Reasoning Models (LRMs) compared to standard Large Language Models (LLMs), focusing on how both types handle problem complexity using controlled puzzle environments. Key findings reveal LRMs' advantages and scaling limits, as well as inefficiencies and inconsistencies in their reasoning processes.

Large Reasoning Models (LRMs) and Evaluation Paradigms

LRMs are specialized LLMs designed for reasoning tasks, using mechanisms like Chain-of-Thought (CoT) and self-reflection.
Existing evaluations using math and coding benchmarks have issues like data contamination and lack insights into reasoning trace quality.
Controlled puzzle environments offer adjustable complexity and clean measurement of both solutions and internal reasoning.

Experimental Design and Puzzle Descriptions

Four puzzles used: Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World, each scalable in complexity.
Evaluation measured both final accuracy and the coherence of reasoning steps via custom simulators.
Access to "thinking tokens" allows analysis of intermediate steps, not just final outputs.

Key Experimental Findings

Three complexity regimes: standard LLMs excel at low complexity, LRMs are superior at medium complexity, and both types collapse at high complexity.
LRMs increase reasoning effort (thinking tokens) as complexity rises, but surprisingly reduce effort before reaching their collapse point.
Reasoning traces show: for simple tasks, correct solutions are found early but overthinking occurs; for moderate tasks, correct answers come late after many wrong attempts; for hard tasks, correct solutions are absent.

Limitations and Open Questions

When provided with explicit solution algorithms, LRMs still fail at similar complexity thresholds, indicating poor step-by-step execution.
LRMs sometimes perform many correct moves in one puzzle before failing early in a different, simpler puzzle—suggesting issues with generalizable reasoning and symbolic manipulation.
Controlled puzzles may not generalize to broader, real-world reasoning tasks.

Key Terms & Definitions

LLM (Large Language Model) — A neural network trained to predict and generate text based on large-scale datasets.
LRM (Large Reasoning Model) — An LLM variant explicitly optimized for step-by-step reasoning and problem-solving tasks.
Chain of Thought (CoT) — A technique where models generate intermediate reasoning steps before an answer.
Thinking Tokens — Generated tokens that represent the model’s intermediate reasoning process.
Compositional Depth — The minimum number of sequential operations required to solve a problem.
Overthinking — When a model continues to generate unnecessary reasoning steps after finding a correct solution.

Action Items / Next Steps

Review model-specific performance on each puzzle and reasoning regime.
Consider implications for future model design and evaluation criteria.
For deeper understanding, study the appendix for puzzle specifications and computational complexity characterization.