Managing Enterprise IT Development in the Era of Token Scarcity

Question (2026-06-11): "How do we think of managing IT development work for enterprise IT in the era of token scarcity? Guardrails, incentives and model choices etc."

The inversion: code became free, tokens became the budget line

The starting fact is an inversion the vault has tracked across five sources. Code Is Free — production, refactoring, and deletion of code are no longer scarce (Harness Engineering (Ryan Lopopolo, AI Engineer)). What is scarce now: token budgets, GPU capacity, and human attention. Software development just became a variable-cost activity, and the meter is the token.

The pathology that follows is Token Maxing: Praveen Akkiraju reports enterprises blowing through annual AI budgets in ~90 days — "as many tokens as you provide will get consumed as quickly as possible." And the supply side is pushing the same direction: Mythos-Class Models launched at $10/$50 per million tokens (~2× Opus), there's no observed ceiling on the thinking-tokens → quality curve, and a single Claude Code workflow session can go from ~1,500 tokens to ~1.5M in seconds. If quality never caps, spend discipline can't come from the model — it has to come from the operating model.

So "token scarcity" isn't really a shortage. It's that tokens became the unit in which IT development work is now priced, and most IT operating models have no muscle for managing it. Three muscles to build: guardrails, incentives, model choices.

1. Guardrails — scope capabilities, phase autonomy, encode policy as input

The guardrail conversation is not "write a usage policy." Five vault patterns, in priority order:

Capabilities, not instructions. "Instructions ≠ capabilities" (Capabilities vs Instructions (Agent Keys)) — telling an agent "never send email" is soft; not giving it the send-email key is hard. Scope what dev agents can touch (repos, environments, outbound actions) at the harness level, not the prompt level.
Autonomy is earned in phases, not granted. The Bike Method and GNP's Blitzy playbook agree: start with high-effort, low-risk bounded work — language upgrades, doc gen, test gen, vuln remediation — where autonomy hit 80–100% in production (Autonomous Software Development with Blitzy (CXOTalk)). Bounded vs Unbounded Tasks is the triage rubric: bounded → automate aggressively; unbounded → keep the Human in the Loop dial high and revisit per model generation ("risk is too high for now", never a flat no).
Encode governance as input, not review. GNP feeds corporate technical guidelines, security guidelines, and test requirements into the prompt; Praveen's version is policy .md files. Post-hoc review of agent output doesn't scale to agent volume — pre-loaded constraints do.
Govern the fleet, not the prompt. The AWARE Framework (identity, context, guardrails, risk scoring, observability) is the technical-controls reference once you have hundreds of dev agents — sophisticated enterprises already run 1,000+ in production. Watch the Zombie AI Agent failure mode: Cvent created ~6,000 agents and only ~1,300 stayed actively used; the inactive ones still hold keys.
Encourage Shadow AI inside the fence. Three sources converge: banning loses the demand signal and the visibility (1 in 5 orgs report a breach caused by shadow AI). Pre-defined non-negotiables + sandboxes beat 200-page policies.

2. Incentives — measure feedback-per-token, never tokens

This is where most orgs will get it wrong first. Two of the four root causes of token maxing are incentive failures: teams measuring each other by token consumption, and no historical friction equivalent. If your dashboards celebrate tokens-per-engineer, you have built a token-maxing machine.

The corrected scoreboard, from the Effective Feedback Compute research (Scaling Laws for Agent Harnesses via Effective Feedback Compute): the pathology isn't high spend, it's low feedback-per-token. Matched-budget experiments show that holding raw cost fixed and improving feedback quality raises success — spend isn't the lever; conversion efficiency is. Ryan Lopopolo spends >$1,000/day in output tokens and calls it investment, because his harness converts tokens into accepted PRs faster than human time would. Same spend, opposite verdict — the difference is the conversion rate.

Practically, for an enterprise IT scoreboard:

Tie spend to outcome metrics, not activity metrics — PRs accepted, tickets deflected, reconciliations completed, cycle time (Token Maxing CIO playbook). "Value = opportunity − cost. Stop optimizing for cost line items" (CIO Agenda 2026 (CXOTalk) — where 88% of companies use AI and <6% get measurable value, largely because they measure the wrong things).
Decompose unit costs — token + tool + platform cost per invocation, weighed against what the invocation produced.
Institutionalize the conversion ritual. Lopopolo's Garbage Collection Day: every Friday, take the week's agent slop, ask why it happened, and encode the fix as a lint, test, doc, or reviewer-agent prompt. That's the incentive loop that raises feedback-per-token over time — harness investment compounds across model releases.
Reward judgment, not throughput. The bottleneck moved to defining work, prioritizing it, and accepting output. Incentives should land on the people who write good specs and good evals (Binary Eval Assertions), because those are what let agents run cheaply unattended.

3. Model choices — routing is the cost lever, frontier is the exception

The model-choice principle is one line: match each task to the cheapest sufficient model (Model Routing). With a ~50× output-price spread between Haiku and Fable-class models, routing — not negotiation with the vendor — is where the money is.

Default down-tier, escalate by exception. Reserve the frontier model for your genuinely hardest problems; mid-tier models nail the vast majority of dev tasks (Matthew Berman's prescription at the Fable 5 launch).
Use Effort levels as routing within a model. Start at low — Fable-on-low ≈ Opus-4.8-on-X-high, and frontier defaults are usually overkill. Diminishing returns at the top are measured, not hypothetical.
Let the buy-side pricing models work for you. Fractional-FTE pricing is emerging for bounded agent work (Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk)) — bounded tasks price cleanly; unbounded ones resist unit pricing, which is itself a signal of where autonomy is real.
Re-run the routing table every model generation. Each generation is materially more efficient; yesterday's frontier task is tomorrow's mid-tier task. This is a quarterly decision, not a one-time architecture choice.

The one-paragraph answer

Treat tokens the way manufacturing treats energy: a real, managed input — not free, not feared. Guardrails: scope capabilities at the harness level, phase autonomy bounded-first, encode policy as prompt input, govern the agent fleet with AWARE-style controls, and keep shadow experimentation inside a fence rather than banning it. Incentives: never measure token consumption as progress; measure feedback-per-token via outcome metrics, and institutionalize a weekly ritual that converts observed waste into durable harness rules. Model choices: route every task to the cheapest sufficient model, start at low effort, reserve frontier for the hardest 5%, and re-tier quarterly. The teams that win the token-scarcity era won't be the ones that spend least — they'll be the ones whose harnesses convert the most tokens into accepted, verified work.

Open tension to watch

Whether harness/routing discipline stays load-bearing is the vault's standing Harness (LLM Agents) contradiction: Praveen says the harness is decisive, Boris Cherny says it shrinks as models improve, Lopopolo says it compounds. Effective Feedback Compute makes this empirically testable — if the EFC gap between good and bad harnesses closes over the next 1–2 model generations, the routing/guardrail overhead described here gets lighter. Re-visit this page then.

2026-06-13 — independent corroboration (Whittemore / AI Daily Brief)

A second, independently-sourced voice now reaches the same enterprise conclusion. The New Dumbest Chart in AI (AI Daily Brief) frames the moment as Token Scarcity proper — "every AI company is now in the token-efficiency business" — and lands on the same operating prescriptions this page derived:

Spend caps per seat (e.g. Uber's $1,500/mo cap — the same datapoint, now reinforced as a pattern rather than an outlier).
Mixed-basket model routing — match each task to the cheapest sufficient model, exactly the §3 lever.
ROI scrutiny as agentic demand outpaces a constrained token supply — the §2 "measure feedback-per-token, never tokens" discipline, arrived at independently.

This matters for confidence: the page's earlier evidence skewed toward vendor/investor-aligned conference talks. Whittemore is a distinct named source converging on the same answer, which is why triangulation was raised 4 → 5 on re-judging (2026-06-13). It is corroboration by convergent expert commentary, not yet by independent before/after FinOps data — so the evidence and groundedness scores hold.

Brand fodder

LinkedIn/Medium angle for a senior IT leader: "Your AI budget didn't run out. Your operating model leaked." Lead with the 90-day budget-burn stat, pivot to feedback-per-token as the real metric (the EFC reframe is genuinely novel for this audience), close with the three-muscle model: guardrails / incentives / routing. On-thesis: positions the author as someone managing the economics, not just the demos.

Cross-links

Concepts · Token Maxing · Token Scarcity · Code Is Free · Effective Feedback Compute · Model Routing · Effort levels · Bounded vs Unbounded Tasks · Human in the Loop · AWARE Framework · Shadow AI · Bike Method · Capabilities vs Instructions (Agent Keys) · Binary Eval Assertions · Zombie AI Agent · Harness (LLM Agents)
People · Praveen Akkiraju · Ryan Lopopolo · Boris Cherny · Enrique Ibarra · Matthew Berman
Sources · Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk) · Harness Engineering (Ryan Lopopolo, AI Engineer) · Autonomous Software Development with Blitzy (CXOTalk) · CIO Agenda 2026 (CXOTalk) · Scaling Laws for Agent Harnesses via Effective Feedback Compute · Mythos-Class Models · The New Dumbest Chart in AI (AI Daily Brief)