Reliable Coding Agents Need Better Codebases

Better models and better harnesses matter. But agents also need codebases where the right context is cheap to find, trust, and use.

Leric Zhang·v0.1·Updated

TL;DR

How can we make coding agents more reliable in real codebases?

The common answers are better models and better harness engineering. Both matter, and both are improving fast. But they are not the whole answer. The codebase itself also determines how reliably an agent can work.

AI has made code cheaper to produce, but it has not made software cheaper to change correctly. The hard part of real modification is not typing the patch. It is acquiring enough context to know what the patch must preserve, where related decisions live, and whether the change is complete.

This hidden cost is context cost. Human teams have always paid it through seniority, onboarding, code review, and tribal knowledge. Coding agents make it visible, they fail when the required context is implicit, scattered, unreliable, or hard to discover.

The Context Minimization Principle reframes software design around this cost: for the modifications a system must realistically support, a design is better when the sufficient context required for correct modification is cheaper to acquire.

A programmer picks up a bug. The actual fix is small: maybe two lines, one condition, a missing field in a serializer, a wrong assumption in a background job, or an edge case in a permission check. But the patch does not take two minutes. It takes three days.

The first day is spent reconstructing how the system is supposed to behave and confirming that the bug is really a bug. The second is spent following the call chain, reading tests, checking old pull requests, and asking why the logic exists in three different places. The third is spent building enough confidence that the change is complete and will not break something somewhere else. When the commit finally lands, the diff looks almost trivial. The work was not.

This gap has a name: context cost.

Context cost is the cost of acquiring enough context to modify a system correctly. Writing the patch itself — editing files, changing tests, committing the diff — is the small, visible part. The rest is context cost: understanding what behavior is intended, where the relevant decisions live, which assumptions must be preserved, which other places must change together, and where the boundaries of the change actually are. In mature codebases, this is usually most of the work.

Experienced programmers feel this tax even without naming it. They call it onboarding cost, ramp-up time, system familiarity, legacy complexity, technical debt, tribal knowledge. Each phrase captures part of the pain, but none of them names the underlying cost directly. The hidden tax is context cost.

For decades, human programmers have paid this tax so routinely that we stopped seeing it. We call it experience. We call it seniority. We call it knowing the codebase. A senior engineer can often make a change safely not because the system is simple, but because years of accumulated modification context have been cached in their head.

Then AI coding agents arrived, and context cost became visible.

Agents make the bill visible

An AI coding agent does not experience a codebase the way a long-tenured human engineer does. It does not remember the hallway conversation from two years ago. It does not know that the billing job is special because a large customer depends on an undocumented export format. It does not remember the incident that made the team afraid to touch a certain module. It does not infer, from organizational memory, that two similar-looking code paths must never be merged.

An agent can search, read, call tools, run tests, and generate patches very quickly. But what it knows must be supplied, retrieved, or encoded in explicit artifacts. What is not reachable is, for practical purposes, absent.

This is why agentic programming produces such a strange mix of power and fragility. The agent can write competent code: implement a function, update an API, generate tests, or refactor a file faster than most humans. But in a real codebase, it may still miss the one place where the same decision is duplicated. It may trust a boundary that is nominal but not meaningful. It may update the obvious test while missing the invariant that was never written down.

When this happens, the easy explanation is that the model is not smart enough. That explanation is too shallow.

Current agents have many limitations, and future agents will become much stronger. Context windows will grow. Retrieval will improve. Tool use will become more reliable. Agents will learn better strategies for navigating repositories, interpreting tests, and asking for missing information. But the underlying problem will not disappear, because correct modification still requires sufficient context.

A larger context window changes how much information can be carried; it does not automatically decide which information is relevant, trustworthy, complete, or sufficient. A smarter agent can search better, but it still needs something to search through. It can reason more deeply, but it still needs reliable signals from the system.

A more capable AI does not abolish context cost. It raises the level at which context cost must be managed.

This is the real significance of AI coding agents. They are not important to software design merely because they can write code. They are important because they turn an indirect tax into a direct one — a cost that human teams have always paid quietly is now charged in plain sight.

We can now see which files an agent retrieved, where it searched, which tests failed, which edits it missed, and how many attempts it needed before the patch became correct. These traces are not a perfect measurement of design quality. They depend on the agent, the prompt, the tools, and the task. But they reveal something that used to be hidden inside human cognition: how much context a correct modification actually requires, and how hard that context is to acquire. The agent is not failing at coding. It is failing at context acquisition.

When an agent fails this way, it is often showing us the same tax human engineers have been paying all along.

This is not an AI problem

It is tempting to treat agent failures as a temporary weakness of today’s models. That would miss the deeper lesson.

When an agent modifies one field but misses the schema migration, the validator, the UI form, the analytics query, and the test fixture, the problem is not only that the agent failed to search widely enough. The system encoded one decision across multiple places without making the relationship reliably discoverable.

When an agent reads an interface but still has to inspect three implementations to understand what the method really means, the problem is not only that the agent lacks judgment. The boundary did not express a trustworthy contract.

When an agent breaks behavior because the key invariant existed only in a senior engineer’s memory, the problem is not only that the agent lacked access to that memory. The system depended on implicit context to preserve correctness.

Humans suffer from the same conditions, but they have one form of access agents typically lack: information that was never written down. A human engineer can ask the teammate who built the system, remember a hallway conversation about why a job is special, or recall the incident that made the team quietly stop touching a module. The durable asymmetry is access to context that lives only in heads and conversations, not in files.

This compensation is powerful, but it has a cost. It makes onboarding slow. It makes refactoring frightening. It makes review depend on the right person noticing the right omission. It makes teams afraid to change old modules. It makes “simple” tasks expand into days of investigation. It makes productivity uneven, concentrated in the few people who have cached enough context in their heads.

For humans, the pain is diffuse. It appears as hesitation, fatigue, long ramp-up, vague risk, and a general sense that the system is hard to change. Agents make the pain sharper. They turn invisible context cost into visible failure.

AI did not create the context problem. It made the bill impossible to ignore.

Software is made of modifications

To understand why this matters, we need to step back from AI. Software engineering is often pictured as two phases — development, where the system is built, and maintenance, where it is changed — as if writing new code and modifying existing code were different jobs. The split is a fiction. From the second commit onward, every change lands on a system that already exists. Modification is not what happens after development; it is development itself.

Adding a feature modifies the current system. Fixing a bug modifies behavior. Refactoring modifies structure while trying to preserve behavior. Reviewing a pull request evaluates whether a modification is safe. Testing checks whether a modification preserved the intended constraints. Useful software is not written once. It lives by changing.

If every change is a modification, and every modification has a context cost, then naturally we want that cost to be smaller. Where decisions live, which boundaries can be trusted, what is written down and what is left implicit — all of these shape how expensive the next modification will be. Software design is the name for the choices behind them.

We usually describe good design with words like beautiful, elegant, or clean. These words point at something real — well-designed systems do feel different to work with — but they are aesthetic reactions, not explanations. The underlying factor they gesture at, without ever naming, is how cheaply the next modification can acquire the context it needs.

A design is good when it makes realistic future modifications easier, safer, and more reliable. A design is bad when it forces the modifier to search too widely, follow too many unreliable paths, or depend too heavily on implicit knowledge before making a correct change.

A small module can be hard to modify if its behavior hides in configuration and framework magic; a large system can be easy if its decisions are well bounded and its tests make the important assumptions executable. The difficulty of changing a codebase is not set by how much code it has, but by how expensive it is to acquire enough context to change it correctly.

The Context Minimization Principle

This is the idea behind the Context Minimization Principle, or CMP.

For the modifications a system must realistically support, a design is better when the sufficient context required for correct modification is cheaper to acquire.

CMP is not a new aesthetic. It is a way to name the trade-off software engineers have always been making. A boundary is useful when it lets us stop reading. A test is useful when it preserves a behavioral fact we should not have to remember. An architecture is useful when it tells us where a decision should live. A convention is useful when it turns search into navigation.

These practices look different on the surface, but underneath they do the same kind of work: they make the context required for future modification cheaper to acquire.

This also explains why they fail. A boundary that cannot be trusted adds indirection without reducing understanding. An architecture that no longer predicts where decisions live becomes ceremony. An abstraction built for a future that never arrives becomes over-engineering. A convention that exists only in someone’s head becomes tribal knowledge.

CMP does not remove judgment. It sharpens the question. Instead of asking whether code is clean, simple, or elegant in the abstract, we can ask whether this design makes the context required for realistic modifications cheaper to acquire.

That question is small enough to use in practice and large enough to connect many old debates. It is also enough for a first principle. The point of CMP is not that every design decision can already be measured with a perfect number. The point is that software design has always lacked a shared currency for its trade-offs. Context cost is that currency.

The work that does not get cheaper

AI will write most of the code. That part is no longer a question — it is already happening on routine changes and will spread to harder ones. Whether the timeline is two years or ten does not change the direction. The cost of producing code is collapsing.

The new question is how to turn that cheapness into reliable systems quickly. Producing code is not the same thing as building a system — each batch of cheap code still has to land on a codebase that can absorb it without losing coherence. Which decisions a change must touch, which assumptions it preserves, which boundaries hold, which tests carry the important constraints — all of these decide whether AI’s speed becomes reliable systems or a faster mess. The faster code production gets, the more design skill matters.

Software design has always been valuable, but it has been hard to argue for. Beautiful, elegant, clean, simple — these are aesthetic claims, and aesthetic claims tend to lose in rooms where shipping is on the other side. CMP changes that. It replaces the aesthetic question with one a team can actually argue about: does this design make context cheaper for the modifications the system will have to support? That question is small enough to apply to a single pull request, large enough to defend an architectural call, and grounded enough that it does not dissolve into preference.

CMP gives design a question that has an answer.

That answer is rarely a precise number. But it is concrete enough to evaluate against and to lose a debate to. Design stops being something senior engineers gesture at and becomes something teams can reason about.

For individual programmers, this changes what is worth practicing. The established design principles and patterns — SOLID, DDD, clean architecture, etc. — do not become less valuable; each of them is a tested answer to manage context cost. Beyond the principles, one habit compounds: notice the context you are silently carrying while you make a change — the assumption you held in your head, the dependency you remembered, the prior decision you trusted without re-deriving — and push as much of it as possible into the artifact, so the next modifier does not have to rediscover what you already worked out. It is a small habit, but over time it is how good design stops being an abstract discipline and becomes a daily reflex.

The future does not bypass context

One possible objection is that future AI may solve this problem automatically. Perhaps agents will become so capable that they can read the whole codebase, infer the architecture, reconstruct the missing decisions, and make correct changes without humans designing for context.

Perhaps they will. But that would not make CMP irrelevant. It would mean the agent has learned to perform context acquisition more effectively. More intelligence does not remove the need for context management. It changes the form of context management.

A future AI might invent better ways to organize software than the patterns we use today. It might generate new indexes, maintain living documentation, derive tests from observed behavior, or continuously map decision relationships across a codebase. It might make context acquisition cheaper in ways we have not yet imagined. But whatever those mechanisms turn out to be, the software they produce will still be beautiful, clean, and elegant in a way we can now explain rather than just feel: organized so that whoever reads it next — a programmer or an agent — needs less context to understand it and to change it correctly.

The bill is on the table

Programmers have always known that some systems are easier to change than others. We know the feeling of a codebase where the next step is obvious: the boundary is clear, the test tells us what matters, the naming leads us to the right place, and the architecture narrows the search.

We also know the opposite feeling: every change opens another file, every rule has another exception, every abstraction leaks, every test is either brittle or irrelevant, and no one is sure where the real decision lives.

That difference is not a matter of taste. It is the central economic fact of software work: software must change, and every correct change requires sufficient context.

Good software design is the discipline of making future modification context cheaper to acquire.

The hidden tax has always been there. AI did not invent it. AI exposed it. Now that the bill is on the table, we can start designing against it — and software design stops being a sensibility a few engineers happen to have, and starts being a discipline a team can build.