AI Anxiety to Agency

Talk to engineers about AI coding tools today and the honeymoon phase of “10x productivity” is over. Skepticism replaced it, mixed with quiet dread.

A sentiment keeps surfacing: “I spent two hours reviewing the AI’s pull request, and I realized I could have written it myself in 45 minutes.” A controlled study by METR tracked sixteen experienced open-source developers on their own mature repositories and found they took 19% longer to complete tasks with AI tools compared to working without them, while believing they were 24% faster. Writing code puts you in a flow state. Reviewing hundreds of lines of code you don’t trust is a different kind of exhausting. You’re babysitting, not building.

On my team, we hit this wall too. We use an internal CLI agent powered by Claude, and early on the experience was demoralizing. Some days it felt like a brilliant pair programmer. Other days it felt like handing a task to a contractor who’s never seen your codebase but won’t admit it.

I sat down with engineers on my team to figure out what was going wrong. The frustrations were specific:

“It hallucinated an API that doesn’t exist in our codebase. I spent twenty minutes debugging a build failure before I realized the function was made up.”

“I asked it to fix a bug in one module and it started referencing error messages from a debugging session I’d had an hour earlier on a different feature.” We started calling this contaminated context: the AI treating stale conversation history as if it were still relevant.

“I spent more time fighting to get it to find the right files than it would have taken me to write the fix myself.”

These all point to the same root cause. AI isn’t stupid. It has a memory problem.

It’s a Memory Problem

The instinct is to blame the model. But in most cases we investigated, the culprit was what the AI knew, or thought it knew, at the moment it started writing code.

Throw everything at it and the output degrades. Research into the “lost-in-the-middle” phenomenon shows that LLMs exhibit a U-shaped attention curve: they attend to information at the beginning and end of their context window, while ignoring the middle. Dump 50,000 tokens of context and your critical architectural constraint on line 847 might as well not exist. Give it too little and it writes generic boilerplate that doesn’t fit your project. Correct syntax, useless output.

We called this the context paradox: too much and the AI drowns, too little and it guesses. Expecting engineers to hit the sweet spot on each prompt is unrealistic.

What Changed

The fix wasn’t “write better prompts.” We tried that. We ran workshops on prompt structure, shared templates, and wrote internal guides. None of it stuck because the problem wasn’t user discipline. It was an engineering problem, and it needed an engineering solution.

We stopped treating context as a single dump. Instead of loading everything the agent might need into one massive prompt, we built a layered system where information gets pulled in only when it’s relevant. Three changes made the biggest difference.

Layered context. At the base layer, the agent knows the non-negotiable stuff: repository conventions, security constraints, the team’s coding standards. We keep this slim. If the AI doesn’t need it on a given interaction, it doesn’t belong here. Lines in this layer compete for the AI’s finite attention, so we treat them like premium real estate.

Above that, we built “skills,” specialized knowledge documents that the agent only sees the title and a one-line description of until it needs them. The selection is semantic: the agent matches the intent of the engineer’s message against skill descriptions and pulls in the relevant guide. If an engineer asks for help triaging a CI failure, it pulls in the CI triage skill. If they’re doing a code review, it pulls in the review standards. The rest stays out of the context window. This was a direct response to the “mixing up memories” problem. The AI was confusing unrelated context because we were feeding it everything at once.

For searching large documentation or unfamiliar code, the agent uses a retrieval layer, embedding-based semantic search over chunked documents, to pull only the specific snippets it needs rather than ingesting entire files. The AI reads a few hundred tokens of targeted context instead of tens of thousands of tokens of everything.

Task isolation. The second change addressed contaminated context. Instead of one long-running conversation where context from a morning debugging session bleeds into an afternoon code review, the agent spins up isolated sub-conversations, separate API calls with independent context windows, for distinct tasks. Need a PR reviewed? That review happens in a clean room: the diff and the review criteria, with no leftover tokens from whatever you were working on before. The parent conversation gets back a summary, not the full exploration. Cross-contamination dropped to near zero.

Automatic environment injection. The third change was the simplest and the most impactful. We added automatic hooks, small scripts that run before each interaction, that inject environmental state. Each time an engineer sends a message, the agent already knows which Git branch they’re on, which files are modified, what the working directory looks like, and the current timestamp. This eliminates a constant source of friction: the AI editing the wrong file, or not realizing you’d switched branches. Engineers stopped narrating their own environment to the tool.

Beyond What the AI Reads

Even with better context, AI-generated code is harder to review than a colleague’s because you can’t trust the author’s intent. We added a constraint: before writing any code, the agent outputs a short execution plan:

Read the existing validator class to understand the interface
Add a new validation method for the date range constraint
Register the validator in the Spring config
Update the existing test suite with two new cases

Reviewing a plan takes thirty seconds. Catching a wrong assumption in a plan saves you from reviewing five hundred lines of code built on that wrong assumption.

We also drew a hard line: the AI cannot declare a task finished until it has proven the code works. If the agent writes a test, the agent runs the test. If the test fails, the agent fixes it. The engineer reviews results, not compiler errors.

What’s Still Hard

The context layers are real work. Our steering files went through a team review that generated close to ninety comments, catching inaccuracies, removing stale information, trimming content that was burning the AI’s attention on things that didn’t matter. Two engineers led this effort over two months, and maintenance continues as the codebase evolves. Keeping the AI’s knowledge accurate is an ongoing burden, similar to documentation, except stale context produces wrong code instead of confused readers.

Context management also doesn’t fix the moments where the AI is wrong about how a system works. No amount of prompt architecture replaces domain judgment. The plan-then-execute pattern catches some bad assumptions early, but it’s not foolproof. Sometimes the plan looks right and the implementation is wrong in ways only a domain expert catches.

The numbers moved. Before we invested in context management and ran learning sessions, half to sixty percent of engineers were using AI, and most of that was casual: chatting with Gemini, running default out-of-the-box agents. After the context architecture was in place, adoption climbed above ninety percent. The nature of usage changed too. Engineers started creating custom agents tailored to their feature areas, building advanced workflows with task isolation and domain-specific steering. They went from passive consumers to active authors in a single month.

There are still days where an engineer will say “I could have done this faster myself.” Sometimes that’s true. The goal was never to make AI faster on every task. It was to make the average interaction reliable enough that reaching for the tool feels natural.

Where This Applies

None of this is specific to a particular agent or model. Layered context, task isolation, environment injection, plan-before-execute: these apply whether you’re using Copilot, Cursor, or a homegrown LLM wrapper. Treat the AI’s context window as a finite engineering resource, not a dumping ground.

The anxiety around AI in software engineering is real. Most of it comes from poor tooling. A tool that hallucinates feels like an adversary. A tool that mixes up context feels like it’s not listening. Those are engineering problems with engineering solutions.

The best days with AI don’t feel like supervision. They feel like a good day of focus where the boring parts got handled.

Resources

METR, “We are Changing our Developer Productivity Experiment Design” — The controlled study finding experienced developers were 19% slower with AI tools despite believing they were faster.

Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023) — The research demonstrating U-shaped attention degradation in LLM context windows.