AI in Complex Codebases

Summary

This document explores the challenges and breakthroughs in applying AI coding agents to large, complex codebases, drawing on recent research and hands-on experience.
Through principles of "frequent intentional compaction" and spec-driven workflows, the team was able to achieve rapid, high-quality contributions to substantial codebases, even as non-experts.
While AI tools often struggle or create extra work in production environments, carefully engineered context management and human review can produce results comparable to or better than traditional development, especially in brownfield projects.
The document details practical techniques for managing agent context and highlights the importance of team alignment, high-leverage human review, and constantly evolving workflows in an AI-first development world.

Join CodeLayer waitlist – Interested developers/teams: Sign up for CodeLayer private beta at https://humanlayer.dev/ for early access to new agentic coding workflows.
Pair with OSS maintainers (in SF) – Author: Offer to spend 7 hours on a Saturday pairing with Bay Area OSS maintainers on shipping a major contribution using context engineering workflows.
Forward-deploy with engineering teams – HumanLayer team: Partner with 10-25 person engineering orgs to help lead the cultural and workflow transformation to AI-first coding.
Inquire about the "thoughts tool" – Interested readers: Start a Claude session in the humanlayer/humanlayer repo to learn how internal context-sharing tools work.

AI coding tools struggle with large, mature codebases, often creating rework and tech debt, according to research and field experience.
Core context engineering principles can enable current AI models to deliver high productivity and code quality even in demanding environments.
The "frequent intentional compaction" strategy involves consistently structuring and distilling context presented to AI agents throughout the dev process.

Talks and studies cited (Sean Grove, Stanford study) highlight that AI coding can lead to rework and is less effective in brownfield and complex systems.
The importance of maintaining detailed specs and context artifacts (rather than ephemeral prompts) is stressed for future-proofing work and maintaining team understanding.

Naive use of AI agents (chat-like, ad hoc) quickly runs into context window limitations, leading to inefficiency.
Incremental improvements include session restarts, intentional compaction (summarizing progress and direction before refreshing context), and leveraging commit messages.
Sub-agents are used for discrete tasks (searching, summarizing) without polluting the main agent’s context window.

The team structures all projects around three phases: Research, Plan, and Implement, each with tailored prompts and sub-agent tools.
Maintaining context utilization in the 40-60% range is recommended, balancing information sufficiency with noise and window limits.
High-leverage human review is built into each phase, especially in research and planning, ensuring correctness upstream.

The methods enabled non-experts to contribute significant, high-quality code to a 300k LOC Rust codebase (BAML), including major features and bug fixes approved rapidly.
Two implementation plans for a bug were developed—one with and one without research; the research-based plan produced more maintainable and aligned results.
Complex upgrades (e.g., WASM, cancellation support) were delivered by small teams using these workflows at speeds far exceeding traditional estimates.

Problems still require deep engagement—there is no universal magic prompt.
Collaboration between AI agents and humans, with expert oversight and iterative context refinement, is key to tackling hard problems.
Failures (e.g., failed hadoop-removal attempt) show the boundary conditions—knowledgeable contributors and sufficiently deep research are sometimes indispensable.
High-leverage review at the research and plan stage is more critical than reviewing code after-the-fact.

Speed and volume increase with AI agents, but maintaining shared understanding and alignment becomes harder.
The move from code reviews of large PRs to reviews of compact, well-written specs and plans is crucial for scaling productivity and onboarding.
Each team must adopt processes that keep members up to date on unfamiliar code areas, whether via PRs and docs or through research/plan/implement specs.

"Frequent intentional compaction" and spec-driven development deliver on the goals: works on brownfield codebases, solves complex problems, reduces slop, and preserves team alignment.
Challenges remain, and some issues require specialized expertise or deeper investigation, but the approach scales well for the majority of problems.
The team is building and launching CodeLayer, a new platform to enable these workflows; the broader industry will need to adapt processes as AI coding agents become ubiquitous.

Adopt frequent intentional compaction workflow — Proven to improve productivity and code quality for AI-driven coding in complex environments, based on direct experience and experimentation.

How can frequent intentional compaction and context management best be scaled across very large, distributed engineering teams?
What are the edge cases or classes of engineering problems where context engineering may not suffice, and where expert intervention is always required?
How will current workflows (e.g., code review, onboarding) need to evolve as the proportion of AI-generated code increases further?