Uber Burned a Year's AI Coding Budget in Four Months. The Fix Is Boring.
Uber handed its engineers agentic coding tools and burned through its entire 2026 budget for them in about four months. So they capped every employee at $1,500 a month per tool, Claude Code and Cursor each. Two tools at that cap runs about $36,000 a year, something like 11% of a median Uber engineer’s total comp going to a pair of terminals.
They’d been running an internal leaderboard ranking teams by how much AI tooling they used, so the spend wasn’t an accident waiting to happen. The cap isn’t the part that stuck with me, though. Asked whether it was worth it, Andrew Macdonald, Uber’s president and COO, said the link between all that spend and anything a rider or driver would feel “is not there yet.” The person running the company couldn’t say what a year’s budget bought.
I build AI harnesses for fun — the routing layer that decides which model handles which task. I didn’t read this as an AI story. I read it as a team that aimed its most expensive model at every job, ran it wide open, and got the predictable bill. This is plumbing, well understood, and most teams just never build it.
Where all the tokens go
A normal API call is one request in, one response out. An agent isn’t. It reads a file, makes a plan, calls a tool, second-guesses the result, calls another, checks again. The one task a human kicked off fans out into ten or twenty model calls and burns somewhere between 5 and 30 times the tokens of a plain completion. That multiplier is most of where the money goes. You don’t feel it on your laptop; you feel it when a few thousand engineers do it all day and the autocomplete you’d filed under “rounding error” costs like a junior contractor.
And the default is always the top model: it’s what gets demoed, what the docs reach for, what works on the first try. Nobody gets paged for running Opus where Haiku would have worked. They get paged when the cheap model ships a bug. So everything defaults to the expensive option.
Most of your spend buys nothing
Across enterprises, 60 to 80% of LLM spend comes from 20 to 30% of use cases, and those high-volume cases are the boring ones: classify a ticket, summarize a diff, pull three fields out of an email. A small model does that work just as well as a big one and charges a tenth as much.
Menlo Ventures put enterprise LLM API spend at $8.4 billion by the middle of last year, doubled in six months and projected to hit around $15 billion in 2026. Industry estimates for how much current inference spend is removable, through routing and caching and smaller models, run from 50 to 90% — even the low end halves the bill.
And this isn’t hypothetical. Factory, one of the agentic-coding vendors, says automatic model selection alone cuts its customers’ costs by about 25%.
The boring levers, in rough order of payoff
None of this is clever, which is exactly why it doesn’t get done. Uber’s bill ran away because nobody puts “wrote a routing table” on a promo packet.
Use the effort dial, but know what it’s for. The current top-tier Claude models take an effort setting, low through max. Haiku doesn’t, so on the cheap tier you control cost by picking the model, not turning a dial. Effort governs how much the model thinks before it answers, so turn it down on easy mechanical steps. Do not turn it down on real coding or agentic work. There you trade a little token savings for worse reasoning and more retries, and the retries cost more than the effort ever saved. Dropping effort on simple tasks and routing hard tasks to a cheaper model are two different levers, and people confuse them.
Caching is free money, if your prompt prefix actually holds still. Your system prompt and tool definitions are byte-identical on every call. Put them first, set a cache breakpoint, and reads of that prefix inside the cache window (five minutes by default) drop to roughly a tenth of the price. Two catches nobody mentions. The first call pays a write premium (1.25x, or 2x for the one-hour TTL), so caching only pays if the prefix gets reused before the window closes, and it does nothing for spiky, low-volume traffic. And one moving token in the cached region, a timestamp or a request id, silently busts the whole thing while your dashboard says nothing is wrong. I lost a week of caching once to a date string somebody helpfully wired into a system prompt. Check your logs actually show cache reads.
The cascade: cheap model writes, expensive model grades. I lean on this one harder than anything else here. Don’t escalate by re-running the whole task on a bigger model. Generate cheap, then pass the candidate to an expensive model to judge. The judge reads the same tokens the generator wrote, but at input price, and emits only a short verdict at output price, where the generator paid 5x output rates. Two things keep it honest: cap escalation at one retry, and calibrate the judge against a labeled set before you trust it. A judge that shares the generator’s blind spot waves bad output straight through; a miscalibrated one escalates everything and quietly doubles your bill. When the escalation rate creeps up, that’s your signal a task’s default tier is set too low.
Batch anything that isn’t waiting on a human. Nightly repo scans, backlog enrichment, eval runs. None of it needs an answer in under a second, and batch inference is a flat 50% off the same tokens. No reason to pay real-time rates for something that runs at 2am.
Put a hard governor on it, and enforce it at the access layer. A dashboard only tells you the budget blew after it already blew. Wire spend into the routing decision instead: past 80% of the month’s allotment, drop non-critical work down a tier automatically and log it loudly. Then cap which models a role can even call. On a cloud provider that’s a model-invocation deny policy (InvokeModel on Bedrock, the predict permissions on Vertex, an RBAC role on Foundry); on the first-party Anthropic API it’s workspace and key-scoped allowlists. One caveat so this doesn’t blow up in prod: a hard deny is a hard failure. Pair the ceiling with a downgrade-and-retry so a wrongly-escalated call drops to an allowed tier instead of erroring out, and scope your service roles tightly, because the deny only binds the principal making the call. Uber’s $1,500 cap is this same lever, slammed on by hand after the damage rather than built into the routing.
Fable 5 just made the cheap tier matter more
Anthropic shipped Fable 5 on June 9th, its most capable public model. It costs roughly double Opus per token, and in high-risk domains it refuses to answer at all, where the recommended pattern is to fall back to something like Opus 4.8.
A pricier ceiling doesn’t raise your bill on its own. It widens the gap between your cheapest workable model and your priciest, so routing lazily costs more this month than last.
That gap is the argument for a local tier. Open-weight models run maybe three to six months behind the frontier, which is fine for the commodity majority of your traffic. Self-host those and, north of 10 to 30 million tokens a day, you can save 40 to 60% against API pricing. Microsoft is making the loud version of this case: its AI chief called Anthropic “extremely expensive” and said the plan is to build their own models and kill the line item.
Here’s what the “just run it locally” crowd leaves out, though. The engineer is the cheap part. A senior inference engineer to babysit that stack runs $250 to 360k a year, but the real bill is GPUs running 24/7, where you pay for idle silicon and the API doesn’t. That 40-to-60% saving assumes you keep the hardware busy; below steady utilization the math inverts fast. And it only pays at scale: if your API bill is $8k a month, standing up a $25k-a-month in-house stack is roughly three times worse.
So run a hybrid. Small or local models for the high-volume floor, frontier API for the genuinely hard 10%, the expensive ceiling locked down for the handful of jobs that earn it.
What Uber got wrong
Uber didn’t really overspend. Overspending means you knew the price and paid it on purpose. Uber didn’t know what the spend was buying. That’s a measurement failure, and a spending cap doesn’t fix it. The cap stops the bleeding and tells you nothing about whether the spend was doing anything.
Every lever above is a couple of weeks of work for a competent platform team: a routing table, an effort policy, a cache breakpoint, the cascade, a governor wired to whatever your access layer happens to be. But the teams that bother run the same agents Uber runs for a fraction of the money, and when someone asks what the AI spend bought, they can answer. The agents work. Whether you can afford to run them is a routing problem most teams never sit down and solve.
Related
- Building Sidekick — the AI harness project these routing and cost levers come out of
- AI Anxiety to Agency — the companion piece, the same shift from the human side