Token Scarcity
Token Scarcity
The rationing regime that follows the token-subsidy era. As usage shifts from assisted (a human in the loop, bounded turns) to agentic (continuous loops, fan-out sub-agents), AI demand outpaces a physically constrained token supply — GPUs, power, and data-center capacity can't scale as fast as agent-driven consumption. The subsidy era ends, prices rise, and every AI company is "now in the token-efficiency business."
The thesis
- Demand outruns supply. Agentic use multiplies token consumption per task; supply is gated by physical infrastructure. The gap closes by rationing, not by more subsidy.
- The subsidy era ends. Labs have been selling tokens below cost to win adoption. As demand hardens, that stops — prices rise toward true cost.
- Firms ration. The enterprise response is a managed-input posture: mixed-basket model routing (cheapest sufficient model per task), token budgets, and per-seat caps.
Distinct from Token Maxing
Token Maxing is the prior behavior — the abundance/subsidy-era pathology of burning tokens because they were cheap and consumption was the scoreboard. Token scarcity is the regime that follows: once tokens are priced at cost and supply is constrained, the maximalist behavior is no longer affordable and rationing discipline becomes mandatory. Maxing is the disease of cheap tokens; scarcity is the economics that forces the cure.
2026-06-20 — "Token Reckoning" Economist coverage
Companies Are Scrambling to Curtail Soaring AI Costs (Economist) is the Business-section confirmation that the regime is now operating in the wild:
- Ramp spend data: AI spending up 13× year-on-year. Token-heavy applications (reasoning models, agents that build agents) are the growth driver.
- Uber spent its annual AI budget in four months. One unnamed firm spent $500m on tokens in a single month.
- The distribution is bimodal — top-1% spenders at ~$7,450/month per employee, median client at $11. The Advantage Gap expressed financially.
- Three corporate response patterns now visible: (1) Meta + Amazon killed their token-usage leaderboards; (2) Routing down-tier — Sonnet ~1/20 of Opus, Kimi ~1/20 of Sonnet (three orders of magnitude across the routing decision); (3) Per-seat / per-task caps (Uber's $1,500/month per coding tool is the canonical case).
- Outcome-based pricing emerging: Intercom charges customers only for queries actually resolved by its IT-support agent — the SaaS pricing model that survives the agentic era.
- The lab-subsidy era ends with the IPOs. Sam Altman called mounting customer costs "a huge issue." OpenAI's strategy for winning customers from Anthropic reportedly involves drastic price cuts — but once both labs IPO later in 2026, prices have to rise toward true cost.
- The geography axis: AI bills are "low compared with hiring a developer in San Francisco, but high compared with employing one in Delhi" — token-cost is now a variable in the on/offshore equation. See Indian IT and AI.
2026-06-27 — Rationing is now operational at the labs (Economist)
Americas Data-Centre Backlash Puts the AI Boom at Risk (Economist) gives the top-of-funnel evidence that inference capacity is genuinely constrained:
- "Anthropic has throttled model usage."
- "OpenAI has scrapped its compute-intensive video tool."
- "Microsoft has repriced its coding assistant so steeply that some programmers are returning to the lost art of writing software themselves."
The physical numbers behind it: ~12 GW of US AI compute today → ~10 GW dedicated to inference across the majors → demand grows faster than the ~30 GW new-build queue through 2028, with training alone potentially absorbing 5–16 GW per frontier model.
2026-06-27 — Total-cost-not-per-token is the buried lede on Chinese "cheap AI"
China Is Having Another AI Moment (Economist) adds the second-order refinement that stops the "we'll just switch to DeepSeek/Zhipu" reflex:
- DeepSeek v4: $0.87 per 1M output tokens; Anthropic Fable 5: $50 per 1M output tokens. ~57× per-token gap.
- Du Zheng (Georgia Tech) et al., June 2026: DeepSeek used 23× more tokens than an OpenAI rival to achieve basically the same result on the same tasks.
- Correct comparison metric = total cost of tokens used, not price per token.
- On a software-engineering benchmark, GLM 5.2 ended up costing more than systems from Anthropic and OpenAI.
The Token Scarcity regime therefore extends: even the "escape hatch to open-source Chinese models" has token-efficiency limits. Whichever model you route to, tokens are the variable cost that matters, not seat licenses or per-token headline prices.
Cross-references
- Token Maxing — the maximalist behavior this regime supersedes
- Managing Enterprise IT Development in the Era of Token Scarcity — the enterprise-IT operating playbook for this regime
- Agentic Loop — the usage shift (assisted → agentic) that drives demand past supply
- Indian IT and AI — the on/offshore axis Token Scarcity compresses
- SaaSpocalypse — outcome-based pricing as the survivable SaaS landing spot
Sources
- The New Dumbest Chart in AI (AI Daily Brief) — "every AI company is now in the token-efficiency business"; the assisted → agentic demand-vs-supply framing
- Companies Are Scrambling to Curtail Soaring AI Costs (Economist) — 2026-06-20; the operating data (Ramp 13×, Uber $1,500/seat cap, $7,450 top-1%, Sonnet 1/20 routing factor, Intercom outcome-pricing)
- Americas Data-Centre Backlash Puts the AI Boom at Risk (Economist) — 2026-06-27; the top-of-funnel evidence (Anthropic throttling, OpenAI killing video tool, Microsoft repricing Copilot)
- China Is Having Another AI Moment (Economist) — 2026-06-27; the 23× token overuse rebuttal to "we'll just switch to Chinese models"