Effective Feedback Compute
Effective Feedback Compute
A trace-level scaling coordinate for agent harnesses introduced in Scaling Laws for Agent Harnesses via Effective Feedback Compute (arxiv 2605.29682, May 2026). EFC's premise: raw expenditure (tokens, tool calls, ops, wall time, cost) is the wrong axis for predicting agent success because it doesn't distinguish useful feedback from redundant or unstable interaction.
The definition
EFC credits a feedback event only when all four criteria hold:
| Criterion | What it rules out |
|---|---|
| Informative | Repeats of what the agent already knew |
| Valid | Hallucinated tool results, spurious self-criticism, broken observations |
| Non-redundant | Near-duplicate feedback already credited earlier in the same trace |
| Retained | Feedback that didn't influence any subsequent decision (received-and-forgotten) |
Then normalized by task demand so EFC values are comparable across tasks with different feedback budgets.
What it predicts
Per the paper:
- Better failure-rate prediction than raw-compute baselines across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch.
- More explained variance than a strong multivariate SAS baseline in controlled scaling.
- Matched-budget interventions confirm the causal direction: holding raw cost and tool calls fixed, improving feedback quality raises success. Spend isn't the lever; feedback efficiency is.
Why it's a load-bearing concept for this vault
EFC is the first source here that gives the harness a measurable scaling coordinate. Until now, the vault had:
- Components of a harness (Harness (LLM Agents))
- Practices for improving one (Harness Engineering (Ryan Lopopolo, AI Engineer), Context Development Lifecycle)
- Failure modes of unbounded compute (Token Maxing)
But no answer to: how do you know your harness is getting better? EFC proposes the metric.
Independent confirmation — MAC meta-agent dynamics (June 2026)
Meta-Agent Challenge (Autonomous Agent Development Benchmark) (Ant Group + CAS) ran a meta-evaluation framework completely separate from the EFC paper and surfaced the same axis empirically from the behavioral side. Across 39 (model × domain × run) configurations, the two dominant predictors of final reward were mean inter-call interval and total runtime — not the number of eval calls, not eval success rate, not first-call delay, not temporal centroid. The paper's own summary:
Successful meta-agents do not treat the evaluation endpoint as a high-frequency feedback signal. They think longer between calls, invest more total compute in artifact design, and probe the scorer sparingly.
Restated in EFC's terms: winners burn fewer feedback events but extract more informative + retained signal per event. This is the EFC thesis lifted from harness-trace data onto meta-agent development dynamics. Two independent papers, six weeks apart, converging on the same coordinate.
Compositions with existing pages
- Token Maxing (Praveen Akkiraju) — token-maxing is the pathology of optimizing the wrong axis. EFC names the right one. The reconciliation is on the Token Maxing page's new "EFC reframing" subsection: tokens are fine if EFC-per-token is rising; the pathology is redundant-feedback burn, not raw burn.
- Harness Engineering (Ryan Lopopolo, AI Engineer) — Lopopolo's operational rituals (lint enforcement, reviewer agents, garbage collection day, package privacy) are best-read as EFC-raising interventions: each one increases the informative-ness or retained-ness of feedback in the agent's trace. His "token billionaire" framing fits EFC's logic — tokens are worth spending if they convert into informative-retained feedback at high rate.
- Context Development Lifecycle (Patrick Debois) — the four steps (Generate → Evaluate → Distribute → Observe) are the process for raising EFC over time. EFC is what Evaluate measures; Observe is the longitudinal series.
- Context Engineering — the four pillars (agentic RAG, GraphRAG, memory, compression) all serve specific EFC criteria. Agentic RAG → informativeness. GraphRAG → non-redundancy via structural traversal. Memory → retention. Compression → non-redundancy via deduplication.
- LLM as Judge — implicit machinery for the valid criterion at scale. You can't compute EFC over real traces without a judge marking each feedback event.
- Agentic Loop — the ReAct cycle is the unit where feedback either gets retained or dropped. EFC is the per-loop quality score.
The three-way harness-importance debate this could decide
The Harness (LLM Agents) page tracks an open contradiction:
- Praveen Akkiraju — harness is decisive today
- Boris Cherny — harness importance shrinks as models improve
- Ryan Lopopolo — harness investment compounds across model releases
EFC is a measurement that could resolve it empirically. If newer models show the EFC gap (EFC-based coords vs raw-compute) closing, Boris is right. If it stays open or grows, Lopopolo wins. Re-evaluate after the next 1–2 model generations.
Implementation caveats
- EFC computation has a cost. The paper presumably uses an LLM judge to label each feedback event by the four criteria. At scale, this judge cost itself becomes a budget item — could be EFC-recursive (use a smaller-model judge to credit a larger-model trace).
- The "task demand" normalization is the hard part. Cross-benchmark scaling-law generalization has historically broken on incomparable difficulty. Whether the paper's normalization holds up across truly different task families is worth reading the full text for.
- Retention is observational, not causal. Marking feedback as "retained" because a later decision references it doesn't prove the decision was better for it. The paper's matched-budget interventions help close that gap but the labeling itself is correlative.
Open follow-ups for this vault
- Read the full PDF and pin: the SAS baseline definition, the judge model used to label criteria, the normalization formula, the held-out task families.
- Watch for replications on harnesses we already track: Claude Code, Codex, OpenClaw, Blitzy. EFC would give a directly comparable productivity metric across them.
- This is the first arxiv paper in the vault — worth tracking the citation chain when it lands at a venue.
Adjacent work landing in the vault
- MLEvolve (Self-Evolving ML Algorithm Discovery) (arxiv 2606.06473, Jun 2026) — Retrospective Memory and Progressive MCGS are mechanism-level moves on EFC's retained and non-redundant criteria respectively. MCGS graph-edges share information across branches (kills cross-branch redundancy); the cold-start KB + dynamic global memory ensures task-specific feedback is retained run-to-run. A natural empirical counterpart — if MLEvolve's gains correlate with EFC measured on its traces, that's mechanistic evidence for the metric.
Sources
- Scaling Laws for Agent Harnesses via Effective Feedback Compute (canonical — coined EFC)
- MLEvolve (Self-Evolving ML Algorithm Discovery) (mechanism-level analog)