The Next AI Upgrade Is the Environment, Not the Model

There is a perception, increasingly common among engineering teams using AI coding agents, that the latest generation of models has become noticeably more expensive to operate. Sessions feel heavier. Context fills faster. The auto-compact warning arrives sooner. The natural inference is that the models are doing more reasoning, and reasoning is expensive.

The data does not support that inference. After instrumenting two independent AI coding agents — Claude Code and Codex CLI — across more than 7,500 turns of real engineering work, a different pattern emerged. The cost is not reasoning. The cost is the agentic loop replaying context the environment has failed to summarize, and the inability of most installations to expose that fact to the people paying for it.

The Measurement

The audit was conducted on a self-hosted operations stack used to manage three small businesses and a homelab. Two coding agents run in parallel: Claude Code (Anthropic, Opus 4.6 and 4.7) and Codex CLI (OpenAI, GPT-5). Both expose session telemetry — token counts, cache statistics, tool-use events, rate-limit windows — through local log files. A small ingestion pipeline was built to consolidate that telemetry into a single SQLite store, surfaced through an internal cost dashboard.

The dashboard is unremarkable in itself. What it revealed was not.

Across 4,184 Claude Code turns over fourteen days, cache-read tokens accounted for 95.9% of total input. Across 135 Codex sessions in the same window, cached input accounted for 92.7% of input. The pattern is not specific to one vendor or one model family. It is structural.

For non-practitioners, “cache-read tokens” require a brief explanation. Modern AI coding agents do not send fresh context with every turn. They cache the conversation history, the system prompt, and any persistent files (such as project documentation), and on each subsequent turn the model re-reads the cache rather than re-processing the raw input. Cache reads are billed at a fraction of the cost of fresh input — typically one-tenth — but they accumulate. A long agentic session with many tool calls will replay the cache hundreds of times.

This is not waste in the conventional sense. The cache is necessary. Without it, every turn would re-process the entire history at full cost, and agentic work would be economically impossible. The cache is doing its job. The problem is that 95% cache-read share is the signature of an environment that is asking the model to operate as an agent without giving it the infrastructure an agent requires.

What the Numbers Are Actually Saying

The 95% figure is not a measure of model inefficiency. It is a measure of how much of an agent’s work is spent re-reading what it already knows because the surrounding system has not given it a place to put that knowledge. When an agent must call a tool, receive its output, integrate that output into a plan, and continue — every step of that loop replays the prior context. Multiply by the depth of an agentic task, and the per-turn cost compounds even when the model is not doing additional reasoning.

The same audit measured tool-use prevalence directly. 57.7% of Claude Code turns invoked at least one tool. 12.5% involved explicit reasoning blocks. Tool use is more than four times as common as reasoning. The architectural reality of contemporary AI work is not “the model thinks harder.” It is “the model orchestrates a sequence of calls, and the loop is the cost.”

This finding is consistent across vendors. Codex CLI’s session logs expose a different schema, but the same shape: cached input dominates, cumulative tokens scale with turn count rather than with task complexity, and rate-limit windows fill long before the user has done anything that feels like heavy reasoning.

The Environment Hypothesis

If the cost is not the model, the natural question is what the cost is. The answer, supported by the same instrumentation, is that the cost is the absence of an environment that lets the model operate as the agent it has already become.

The current generation of frontier models is capable of sustained, multi-step reasoning across long horizons. Their architecture supports tool use, structured output, agentic decomposition, and persistent memory. But the environments most teams deploy them into are configured for the previous generation of use cases — short conversational exchanges with manual context management. The model is operating as an agent inside a chat interface. The chat interface assumes the user will summarize, re-enter, and curate context. The agent does none of that. The cache fills. The cost compounds.

The plateau in perceived AI capability is real, but it is not a plateau in the model. It is a plateau in the surrounding system. The model has continued to improve. The environment has not.

This reframes the optimization question. Teams concerned about rising AI costs typically respond by asking which model to use, what context to send, or whether to switch vendors. The data suggests these are second-order questions. The first-order question is whether the environment provides the agent with the infrastructure agentic work requires: durable memory outside the conversation, deliberate checkpointing instead of lossy compaction, on-demand context retrieval rather than always-loaded files, and visibility into what is actually consuming the budget.

What an Enhanced Environment Looks Like

The audit was not purely diagnostic. As the measurements accumulated, four interventions were implemented and evaluated against the same telemetry.

Intervention 1: Pre-Compaction Checkpointing

AI coding agents auto-compact at approximately 95% of context window. Compaction is the model’s own attempt to summarize the session into a smaller form. It is lossy, expensive (the summarization itself costs tokens), and frequently produces a post-compact context that retains 30-40% of pre-compact size — meaning the next turn pays the cache-read tax on a summary that is itself imprecise.

A pre-compact hook was installed that fires before compaction runs. It captures the relevant decisions, file paths, and outcomes from the session into an external durable store, not as a model-generated summary but as a deterministic record. The pre-compact event becomes a checkpoint the user can return to, independent of whatever the model decides is worth retaining. Three checkpoints were captured automatically over the first 24 hours of operation, each preserving context the compaction would have summarized into prose.

Intervention 2: Memory Pruning

Most AI coding agent installations include a project-level memory file that loads on every session. These files accumulate over time as users add reminders, project notes, and historical context. The audit found this file had grown to 187 lines, contributing approximately 2,700 tokens of cache-creation cost on the first turn of every session and 2,700 tokens of cache-read cost on every subsequent turn.

The file was reduced to 73 lines containing only stable identity, infrastructure addresses, and operational invariants. Project-specific and session-specific content was relocated to discrete sub-files retrieved on demand rather than always-loaded. The deterministic byte difference yielded a per-turn savings of approximately 1,740 tokens. Over the 4,184 turns logged in the prior fourteen days, that pace would have eliminated 7.27 million tokens of avoidable spend.

The savings are not theoretical. They are the direct mathematical result of reducing the size of a file that loads on every turn. The session-shape variance still dominates per-turn averages at small sample sizes, but the file-level saving is deterministic and persistent.

Intervention 3: Cross-Session Visibility

The instrumentation that revealed the 95% cache-read share is itself the third intervention. Most teams using AI coding agents have no view into per-session, per-model, or per-project token consumption beyond what the vendor’s billing dashboard exposes — and vendor dashboards aggregate at a level that obscures the agentic loop signature entirely. Without local instrumentation, the diagnosis of “the cost is the loop, not the reasoning” is unavailable.

The local cost dashboard reads session JSONL files (which both Claude Code and Codex CLI write to disk in plaintext), parses them on a cron, and exposes per-day, per-model, per-project, and per-session breakdowns. The dashboard is a single Flask blueprint. It took an evening to build. It revealed a problem nobody had previously had the data to articulate.

Intervention 4: Cross-Platform Confirmation

The final intervention was simply running the same instrumentation against a second AI coding agent. The Codex CLI ingester was added in parallel with the Claude Code ingester. The result confirmed the architectural point: 92.7% cached input share on Codex versus 95.9% on Claude Code is the same story told twice. The pattern is not a Claude-specific quirk or a Codex-specific quirk. It is the shape of agentic AI work in environments that have not been upgraded to host it.

The Strategic Implication

The implication for organizations using AI in operational settings is not that AI costs are lower than expected. They are roughly what is reported. The implication is that the lever for reducing them is not where most teams look.

Switching models will not change the cache-read share materially, because the share is a function of the agentic loop and not the model. Reducing prompt size will not change it, because the cache is doing its job — fresh input is already a small fraction. Asking the model to “be more efficient” is not actionable, because the efficiency is already high; the loop is the structural cost.

The lever is the environment. Durable memory outside the conversation. Checkpointing before compaction. On-demand context retrieval instead of always-loaded files. Visibility into what is actually consuming the budget. These are infrastructure decisions, not model decisions.

This is consistent with a broader pattern in technology adoption that organizations seeking to deploy AI thoughtfully will recognize. The first generation of any new capability tends to plateau not because the capability has stopped improving but because the surrounding system was designed for the prior generation. The plateau breaks when the surrounding system is upgraded to match.

For AI specifically, that upgrade is not glamorous. It does not involve switching to a more capable model. It involves treating the AI agent as a service with operational requirements — memory, observability, checkpointing, durable state — and building the modest infrastructure required to provide them. The cost savings are real, measurable, and largely deterministic. More importantly, the agent becomes capable of work it could not do before, because the work was previously bounded by the environment rather than by the model.

What Organizations Should Measure

The first concrete step for any organization using AI coding agents at scale is to establish local visibility into per-session token consumption. Vendor billing dashboards aggregate at the wrong level. The signature of an environment problem — high cache-read share, high tool-use prevalence, frequent auto-compaction — is invisible at the billing layer and unmistakable at the session layer. The data is already in the local log files. It only requires a small amount of effort to surface it.

Once that visibility exists, the priorities tend to order themselves. Files that load on every turn become candidates for pruning. Auto-compaction events become candidates for pre-compaction checkpointing. Long-running sessions become candidates for explicit handoff to fresh sessions with curated context. Each of these is a small, specific intervention with a measurable outcome.

The organizations that adopt this practice will discover, as the audit discovered, that their AI costs are dominated by environmental decisions they have not yet made. The model is not the problem. The model has been ready for some time. The environment is what is still catching up.

Exploring similar questions?

WBA works with organizations navigating operational complexity. If this analysis resonates with challenges you're facing, let's start a conversation.

Start a Research Inquiry →