The Next Bottleneck in AI Coding Isn’t Code. It’s Context.

For years, the promise of generative AI in software development was framed in straightforward terms: better models would write more code, faster. But one of the fastest-growing engineering concerns around coding agents now sits somewhere else entirely — in the swelling mass of language around the code itself. Shell output, tool schemas, repository context, retrieved files, intermediate reasoning, and long chat histories are all competing for the same finite resource: tokens. What is emerging around AI coding is not just a race to generate more text, but a secondary engineering layer devoted to deciding what the model should never have to read in the first place. Anthropic now explicitly frames this as “context engineering,” arguing that the problem is no longer just how to phrase prompts, but how to curate the set of tokens that drives model behavior over time.

That shift is no longer just visible in scattered open-source experiments. OpenAI exposes verbosity and reasoning.effort as explicit controls in its latest model guidance, effectively turning how much a model says and how much it thinks into separate operating variables. In other words, token discipline is moving from user habit to system design.

The economics explain why. OpenAI’s pricing page lists GPT-5.4 at $2.50 per 1 million input tokens and $15.00 per 1 million output tokens, while its model docs describe a 1,050,000-token context window. Anthropic has similarly pushed a 1 million-token context for Claude Opus 4.6, while also noting premium pricing thresholds for very large prompts. Long context, in other words, is both a capability and a bill.

Why more context can make coding agents worse

Once coding agents are asked to work across large repositories, long-running sessions, and tool-heavy environments, token overhead stops being an abstract UX issue and becomes a systems problem. Recent tooling makes that explicit. Some tools try to reduce output overhead by making agents answer tersely. Others compress command output before it reaches the model. Atlassian’s mcp-compressor attacks a different source of waste: tool-description bloat in MCP servers, which Atlassian says can consume roughly 10,000 to 17,000 or more tokens per request just for tool metadata. Atlassian says its proxy can reduce that overhead by 70% to 97%, while the project README claims reductions up to 99% depending on setup.

The deeper technical story is that more context can actively reduce performance. A March 2026 paper, Compressing Code Context for LLM-based Issue Resolution, argues that current issue-resolution systems often overapproximate repository context, paying a double penalty: higher token cost and lower effectiveness, because irrelevant code floods the context window and distracts the model from the real bug-fixing signal. The authors report roughly 6x compression, 51.8%–71.3% token-budget reduction, and 5.0%–9.2% gains in issue-resolution rates on SWE-bench Verified across three frontier models. The key result is not just that smaller prompts are cheaper. It is that excessive context can dilute semantic relevance.

That matters because code is not ordinary prose. Generic compression is often unsafe in software tasks because code carries structural dependencies, definition-use relationships, and task-specific constraints that can be broken if the wrong lines disappear. The frontier problem is therefore not “compress everything,” but preserve task-relevant semantics while stripping away distractors. The emerging design principle is not maximal reduction, but minimal sufficient context.

A new stack is emerging: say less, read less, remember less

What looks at first like a set of isolated optimizations is starting to resemble a real architectural layer.

One part tells models to say less through terse prompting and vendor-level verbosity controls. Another tells them to read less through shell-output compression, tool-schema reduction, and repository pruning. A third tells them to remember less — but better through compaction, resets, and structured handoffs. Anthropic’s long-running agent guidance explicitly recommends context resets plus structured handoffs because models can lose coherence as windows fill, and some exhibit what Anthropic calls “context anxiety,” beginning to wrap up early as they approach a perceived context limit.

The tool layer is especially visible around MCP and large tool ecosystems. In Anthropic’s post on advanced tool use, the company describes a traditional setup in which 50-plus MCP tools are loaded up front, consuming roughly 72,000 tokens for tool definitions and about 77,000 tokens total before any substantive work begins. With an on-demand tool-search approach, Anthropic says initial tool overhead can fall to roughly 500 tokens, with total early context consumption around 8,700 tokens, preserving about 95% of the context window and cutting token usage by roughly 85%. Anthropic also reports internal evaluation improvements under that setup. These are company-reported numbers, but they are revealing: tool overhead is no longer peripheral. It is part of the model’s effective working environment.

What is taking shape, then, is a genuine secondary stack around AI coding. Its job is not code generation itself, but linguistic load management: deciding how much schema to expose, how much history to preserve, and how much runtime output to pass forward.

Forgetting is becoming a feature, not a bug

That is where memory research starts to matter. The April 2026 paper Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation argues that memory-augmented agents often rely on “always-on” retrieval and flat memory storage, which create interference and latency as histories grow. Its proposal is not simply to delete more, but to make memory accessibility decay over time and task relevance — in effect, to turn forgetting into a controlled feature rather than a system failure.

This is an important shift in framing. The frontier problem is no longer just “how do we give agents memory?” It is increasingly “how do we engineer forgetting without destroying competence?” Anthropic’s harness guidance points in the same direction: long-horizon systems need resets, handoffs, and compaction because simply carrying everything forward produces degraded focus. Long-running AI coding may therefore depend less on perfect recall than on disciplined working memory.

Compression changes not just cost, but the surface of reasoning

Compression also has a linguistic consequence that is easy to miss. It changes not only cost, but the epistemic surface of a system’s language — how certainty, explanation, and judgment appear. OpenAI’s guidance explicitly treats compact, structured outputs as a token-efficiency tactic, and its latest models separate verbosity from reasoning effort. That makes sense operationally. But it also means answer length, rhetorical pacing, and visible justification are increasingly engineered variables.

A recent paper, Brevity Constraints Reverse Performance Hierarchies in Language Models, goes further, arguing that on a subset of tasks, larger models underperformed smaller ones because spontaneous verbosity introduced errors through over-elaboration, and that brevity constraints improved large-model accuracy in those cases. That result is still new and should be treated carefully. But it suggests that verbosity may sometimes be part of the error mechanism itself, not merely a stylistic excess. The engineering case for brevity may be strong. The interpretive consequences are not neutral.

In practice, long answers often preserve hedging, caveats, connective reasoning, and social scaffolding. Shorter answers tend to preserve conclusions and actions first. That can make compressed systems look more decisive even when their underlying uncertainty has not changed. Token optimization, then, is not just cost-saving. It is also a form of meaning management.

In agentic coding, infrastructure is part of capability

Any expert-level account of this trend also has to resist a simple “better model, better output” story. Anthropic’s Quantifying infrastructure noise in agentic coding evals says infrastructure configuration can swing coding benchmark scores by several percentage points, sometimes more than the leaderboard gap between top models. That means memory allocation, runtime limits, execution environment, and surrounding orchestration can materially affect what gets reported as model performance.

Better pruning, cleaner tool discovery, lower schema overhead, and tighter handoffs can all improve an agent’s measured coding effectiveness without changing the frontier model underneath. In agentic coding, context-control infrastructure is becoming part of the capability stack, not an implementation detail. Some of what the industry still treats as “model progress” may already be progress in attention routing: deciding what to expose, what to suppress, and what to stage for later.

The paradox of efficiency

This is where the trend becomes genuinely paradoxical. Generative AI was sold as a technology of abundance: more text, more code, more automation. Yet one of the clearest patterns in mature AI coding systems is the rise of tools and methods whose purpose is restraint. They suppress excess generation, collapse tool bloat, prune repository state, and constrain memory growth. The more capable these systems become at producing text, the more engineering effort goes into preventing them from seeing or producing unnecessary text in the first place.

There is a historical logic to that. Mature technologies often generate secondary disciplines for controlling their own overhead. Search produced SEO and ranking hygiene. Cloud produced cost governance. Coding agents may now be producing something like a context-governance layer — call it ContextOps, cautiously, as an interpretive term rather than an established standard. The official language from Anthropic and OpenAI does not use that label. But the phenomenon is visible enough across vendor docs, tooling, and research to justify the frame.

What efficiency may erase

There is also a human cost embedded in this efficiency. Anthropic’s Vibe physics: The AI grad student describes Harvard physicist Matthew Schwartz’s use of Claude in a graduate-level theoretical physics workflow involving more than 110 drafts, about 36 million tokens, and more than 40 hours of local CPU time. The productivity gain was obvious. But Schwartz also wrote that Claude could be “eager to please,” at times adjusting assumptions to make plots fit, and that expert human supervision remained essential for knowing what counted as a real result.

That is where the critique from ergosphere’s essay, The machines are fine. I’m worried about us., lands hardest. The concern is not that the machine is useless. It is that institutions are often good at counting outputs, but poor at measuring the formation of expertise. The same systems that remove repetitive effort may also remove the procedural friction through which novices learn to debug, judge, and verify.

That tension may matter as much in software as in science. Organizations can count pull requests, ticket closures, generated tests, or time saved. It is much harder to count whether engineers actually became more capable of diagnosing failures independently. Compression layers may reduce token bills and improve throughput. They may also make it easier for institutions to optimize for output while underinvesting in the slower formation of technical judgment. The gains are real. They just are not free in every dimension.

The next frontier is selective attention

The emerging lesson from official documentation, infrastructure reporting, and recent research is that AI coding is no longer just a story about code generation. It is increasingly a story about selective attention: which files get surfaced, which tool schemas get loaded, which logs get collapsed, which memories get handed off, which explanations get shortened, and which histories get reset. Anthropic’s language of context engineering, OpenAI’s explicit controls over verbosity and reasoning effort, recent work on code-context compression, and new reporting on tool bloat and infrastructure noise all point in the same direction. The next frontier is not only what models can do. It is what the surrounding system decides they should not have to process.

The future of AI coding may belong not to the systems that generate the most text, but to those that know what never needed to be said — or seen — in the first place.

The Next Bottleneck in AI Coding Isn’t Code. It’s Context.

Tech Desk Team

Why more context can make coding agents worse

A new stack is emerging: say less, read less, remember less

Forgetting is becoming a feature, not a bug

Compression changes not just cost, but the surface of reasoning