Scaling Laws for Agent Harnesses via Effective Feedback Compute
Scaling Laws for Agent Harnesses via Effective Feedback Compute
Arxiv 2605.29682 (cs.CL, submitted 2026-05-28). The first paper in this vault to propose a trace-level scaling coordinate for agent harnesses — not raw tokens, tool calls, or wall time, but feedback that is informative, valid, non-redundant, and retained. The paper's core claim is that harness scaling is governed less by how much compute is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.
"Harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback."
Captured into the vault via a Telegram message (#3106) asking to save the arxiv link.
The core argument
Current test-time scaling analyses parameterize agent performance by raw expenditure — tokens, tool calls, operations, wall time, cost. The authors argue these coordinates conflate useful feedback with redundant or unstable interaction, which is why raw-compute scaling laws don't generalize well across harness designs.
Their answer is Effective Feedback Compute (EFC): a trace-level coordinate that only credits feedback when it is informative, valid, non-redundant, and retained for subsequent decisions, normalized by task demand.
Four criteria for "effective" feedback
A feedback event in the trace counts toward EFC only if it is:
- Informative — it carries signal the agent didn't already have
- Valid — it's correct / well-grounded (not a hallucinated tool result or spurious self-criticism)
- Non-redundant — it isn't a near-duplicate of earlier feedback in the same trace
- Retained — it influences a subsequent decision (not received-and-forgotten)
Then normalized by task demand so EFC numbers compare across tasks with different feedback budgets.
The empirical claim
Tested across:
- Synthetic controllable tasks (varied feedback budgets)
- Executable code tasks
- Real benchmark traces
- Held-out splits
- A prospective validation batch
Result: EFC-based coordinates predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling experiments, raw tokens and tool calls explain limited variation while EFC-based measures explain substantially more.
The matched-budget intervention is the punchline: improving feedback quality raises success while raw cost and tool calls are held fixed. That decouples performance from spend.
Why this matters to the vault
- Directly contradicts the implicit "more tokens = more capability" assumption woven through Token Maxing. Token-maxing pathology is about budget burn; EFC argues budget burn was the wrong axis all along — the right axis is informative-feedback rate per dollar. The two pages now compose: Token Maxing names the failure mode, EFC names the correct optimization target.
- Operationalizes Harness (LLM Agents). The vault already has the harness components (tools / context / memory / guardrails / observability) and the harness-engineering practice (Harness Engineering (Ryan Lopopolo, AI Engineer), Context Development Lifecycle). EFC adds a measurable scaling coordinate — turning "is my harness improving?" from a vibe into a number.
- The "retained" criterion lines up with Context Engineering's four-pillar frame. Retention-of-feedback is the agentic-memory pillar; non-redundancy is the compression pillar; validity is the retrieval pillar. EFC is essentially asking: did the context engineering actually convert this trace's information into durable state?
- Resolves the open contradiction on the Harness (LLM Agents) page about harness importance over time. Praveen says harness is decisive, Boris predicts it gets less important, Lopopolo says investment compounds. EFC offers a measurement that could decide the debate empirically — if EFC predicts more variance than raw-compute on newer models too, Lopopolo wins; if the gap narrows, Boris wins.
Connections to existing wiki concepts
- Harness (LLM Agents) — EFC is the trace-level coordinate that the harness's job is to raise. New "Harness scaling" section added to that page.
- Token Maxing — EFC is the antidote framing: it isn't tokens that matter, it's feedback-per-token. New "EFC reframing" subsection added.
- Harness Engineering (Ryan Lopopolo, AI Engineer) — Lopopolo's reviewer agents, lint feedback, and garbage-collection-day rituals are all EFC-raising interventions in this paper's vocabulary. The vault's most operational harness source maps cleanly onto this paper's coordinate.
- Context Development Lifecycle — Debois's Generate → Evaluate → Distribute → Observe loop is the process for raising EFC over time; this paper is the measurement.
- LLM as Judge — implicitly the mechanism for assessing the "valid" criterion at scale (you need a judge to mark each feedback event as informative-or-not).
- Agentic Loop — the ReAct cycle is exactly where feedback gets retained-or-dropped. EFC is the per-loop quality score.
Open questions / things to verify
- The paper's "informative / valid / non-redundant / retained" labels presumably need a per-trace annotator (LLM judge or human). The cost of computing EFC itself isn't reported in the abstract — could be prohibitive at scale.
- "Normalized by task demand" is doing a lot of work — the difficulty of comparing EFC across tasks is exactly the cross-benchmark problem that broke prior coordinates. Worth reading the paper to see if their normalization holds up.
- The SAS baseline ("strong multivariate") isn't named in the abstract — pin it before citing the comparison.
- Authors are from a Chinese institution (likely HIT / Harbin Institute of Technology given the author overlap — Qingfu Zhu, Wanxiang Che are HIT NLP); worth confirming. No venue beyond arxiv as of capture.
Brand fodder
The headline that pops for the user's senior-IT-leader audience: "Stop measuring AI by tokens. Start measuring it by useful feedback." That's a near-ready Medium post — pairs cleanly with the Token Maxing pathology and the Lopopolo token-billionaire counterpoint already in the vault. The reframing — budget is fine, redundant-feedback budget is the problem — is the on-thesis insight.
Cross-links
- Concepts · Effective Feedback Compute · Harness (LLM Agents) · Token Maxing · Context Engineering · Agentic Loop · LLM as Judge · Context Development Lifecycle
- Adjacent sources · Harness Engineering (Ryan Lopopolo, AI Engineer) · Context Is the New Code (Patrick Debois, AI Engineer) · Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk)
Source
- arXiv abstract: https://arxiv.org/abs/2605.29682
- PDF: https://arxiv.org/pdf/2605.29682
- DOI: https://doi.org/10.48550/arXiv.2605.29682
- Local capture
- Captured-by trigger: Telegram message #3106 (2026-05-30 19:33)