SecondBrain
Ask the Brain
Index/Conceptupdated Sat Jun 13 2026 08:00:00 GMT+0800 (Philippine Standard Time)

Token Maxing

token-maxingtoken-scarcitycostenterprise-aigovernancescaling-lawsedge-aiadvantage-gap

Token Maxing

Praveen Akkiraju's term for the enterprise pathology where teams burn through annual AI budgets in ~3 months.

"As many tokens as you provide will get consumed as quickly as possible."

Cited via a Goldman Sachs study (uncorroborated in the source — would be worth tracking down before quoting).

Why it happens

  • Software is now variable cost. Every prompt, tool call, reasoning step, and sub-agent burns tokens.
  • Teams measure each other by token consumption (a recurring meme on X/Twitter), which incentivizes more consumption, not better.
  • Multi-agent architectures multiply the cost — each agent's tool calls compound.
  • No equivalent of historical cost containment (CPU/storage budgets had natural friction; tokens don't).

What CIOs should do (per Praveen)

  • Prioritize: not every team should build every agent. Old-school project prioritization still applies.
  • Decompose costs: tokens + tools + platform. Each agent invocation has stack-cost, not just model-cost.
  • Tie to ROI metrics: calls deflected, AP reconciliations completed, time saved — concrete business measures, not vanity tokens-per-engineer.
  • Lean on model efficiency: each model generation is materially more efficient (Opus 4.5 → 4.7; GPT 5.5).
  • Design for memory and context reuse: the LLM Wiki Pattern is itself a token-management strategy — compile knowledge once instead of re-deriving on every query.

Cross-references

  • CLI vs API vs MCP / CLI vs MCP (IBM Technology) — picking the right tool interface is itself a token-maxing mitigation. CLI returns ~200 tokens of summary; an API can return 100K of JSON. Multiplied across many calls, this is real money.
  • Context Engineering — precision retrieval directly reduces tokens by sending less noise to the model.
  • Edge inference (Nvidia RTX Spark) — the infrastructure mitigation. Nvidia's AI PC Push (Economist) argues agentic AI will explode token volume and that offloading inference from the cloud to local devices is cheaper. Physical token-maxing: move the work off the metered cloud meter, not just spend less on it.
  • Boris Cherny runs hundreds of agents in parallel and dozens of nightly loops — would be interesting to see his token-cost discipline; not addressed in his interview.

The "token billionaire" counterpoint

Ryan Lopopolo proudly identifies as a "token billionaire" — >$1,000/day in personal output token spend (Harness Engineering (Ryan Lopopolo, AI Engineer)). He frames this as investment, not waste:

"I am a token billionaire and I believe that in order for us to get into our AGI future, we want everybody to be token billionaires to use the models to do the full job."

His tokens roughly third: planning/ticket curation, implementation, CI work (reviewer agents). The implicit thesis: tokens are worthwhile when the harness converts them into accepted PRs faster than human time would. Praveen's pathology framing applies when there's no harness and no ROI metrics (= burning tokens for theatre).

The reconciliation: token consumption is a problem if uncoupled from outcome metrics; it's an investment if your harness reliably converts tokens → merged code → product value. Praveen's CIO advice (decompose costs, tie to ROI metrics) is the missing discipline that makes Lopopolo's lifestyle sustainable at organizational scale.

The EFC reframing

Scaling Laws for Agent Harnesses via Effective Feedback Compute (arxiv 2605.29682, 2026) argues the deeper problem is that tokens were always the wrong coordinate. Raw expenditure (tokens, tool calls, ops, wall time, cost) doesn't distinguish useful feedback from redundant or unstable interaction. The paper's empirical finding: matched-budget interventions show feedback quality raises success while raw cost and tool calls are held fixed.

Rewriting token-maxing through that lens:

  • The pathology isn't high token spend, it's low EFC-per-token (Effective Feedback Compute) — burning budget on redundant, uninformative, or un-retained feedback.
  • Lopopolo's token billionaire status is fine if his harness keeps the EFC-per-token rate high (and his rituals — lint enforcement, reviewer agents, garbage-collection day — are exactly the kind of EFC-raising interventions the paper would predict).
  • Praveen's enterprise pathology is the case where teams spend without measuring EFC at all — the equivalent of measuring engineering productivity in keystrokes.

The corrected CIO advice composes with Praveen's original list: prioritize, decompose costs, tie to ROI metrics, plus measure feedback-per-token, not tokens alone. ROI metrics (deflected calls, completed reconciliations) are the proxy; EFC is the underlying mechanism that explains why some token spend converts and other token spend doesn't.

This also rewrites the Harness (LLM Agents) importance contradiction. If newer models keep the EFC-vs-raw-compute gap open (Lopopolo's investment-compounds prediction), token-maxing-as-pathology stays a permanent enterprise discipline. If the gap closes (Boris's prediction), the next generation of models will tolerate sloppy harnessing and the discipline relaxes.

The "no token ceiling" claim (Fable 5 launch)

MYTHOS MYTHOS MYTHOS (Matthew Berman) sharpens the pathology from the supply side. Citing OpenAI's Noam Brown, Berman relays that there's no observed limit in the thinking-tokens → quality relationship — keep throwing tokens and output keeps improving (even past ~100M tokens). Fable 5 is "incredibly willing and eager to use all those tokens," and Claude Code Workflows can take a session from ~1,500 tokens to ~1.5M in seconds by fanning out to dozens of sub-agents.

This is the bull case for token-maxing as investment — but it makes Model Routing and EFC discipline more urgent, not less: if quality never caps out, only routing + feedback-per-token keep spend coupled to value. Berman's own prescription is explicitly route down-tier for everything that isn't your hardest problem.

User-side evidence from the Fable 5 launch (Fable 5 Raises the Bar for AI Ambition)

  • Theo: "Current pace has me out of fable usage in about an hour. Do I make a second account or do I pay API prices?" Power-user plans collapsing under Fable token-hunger.
  • Chubby: literally hit end-of-max-plan in hours. "When you're having too much fun with Fable 5."
  • Counter-evidence — Fabio Jonathan: "Fable is cheaper than Opus in practice. Costs more per token, but one-shots way more often, so I'm not burning time." Tyler Willis and Alex Vulkoff broadly agree.
  • Stripe: months → days on a 50M-line Ruby codebase migration — the token spend that converts at high EFC.

Whittemore's framing closes the loop: in the usage-pricing era, individuals will need to become token-efficiency optimizers themselves — develop the muscle to know which class of model fits which use case. Model Routing is no longer a CIO concern — it's an individual literacy.

Enterprise-side moves (Uber + Glasswing partner subsidies)

  • Uber capped per-employee token spend at $1,500/month (The Next Wave of Enterprise AI). First clean per-seat-rationing example. Whittemore promised a follow-up episode on what does/doesn't work.
  • Project Glasswing partners running through millions of dollars of tokens fast — Anthropic still subsidizing per The Information; firms aligning their budgets for GA. The token-cost reality at the Mythos tier puts the lower-tier worry in perspective.
  • 30-day data retention for Mythos-class (Mythos-Class Models) is a second-order cost — it makes Mythos unusable for NDA-bound enterprises, forcing premium-but-mid-tier usage on non-Mythos models for sensitive work.

Seat → usage shift is now structural

Whittemore data point (The Way We Use AI is Changing): Anthropic at $3M run rate last year → $47B run rate now. Seat-based pricing got out of the way because seat math couldn't capture the gap between casual users (~7 turns/day) and Pro users (~11× free) and lab-vanguard users running continuous loops. The shift is the financial expression of the Advantage Gap. Token-maxing discipline is now what enables fair billing, not just cost containment.

2026-06-13 — The pivot to token scarcity

The center of gravity has flipped from token maxing (abundance/subsidy era) to Token Scarcity (the rationing regime). The framing crystallized in The New Dumbest Chart in AI (AI Daily Brief) — the explicit sequel to the original token-maxing coverage — with the line that "every AI company is now in the token-efficiency business." As use shifts assisted → agentic, demand outruns a physically constrained token supply, the subsidy era ends, and the maximalist behavior catalogued above stops being affordable.

  • Volume growth dwarfs price-mix savings. Ramp spend data: a $11.38 median monthly AI bill, ~$610/mo for the top-10% of spenders, and ~$75k/yr for the top 1%. The distribution is exploding at the top — so the savings from routing down-tier or negotiating better per-token prices are swamped by the sheer growth in tokens consumed. You can optimize price-and-mix all you want; volume is the term that dominates.
  • This reframes the page's earlier "token billionaire" and "no token ceiling" sections: in an abundance era those were investment theses; under Token Scarcity they become the exact behavior that has to be rationed.

The supporting Economist threads from the same window: Apple's New Siri Bets on Google Models (Economist) (efficiency/cost-driven model sourcing), Fear of the SaaSpocalypse (Economist) (the per-seat → per-usage business-model upheaval token economics force), and How to Share AI Riches (Economist) (the macro stakes of who captures AI's value).

Sources