SecondBrain
Ask the Brain
Index/Conceptupdated Sat May 30 2026 08:00:00 GMT+0800 (Philippine Standard Time)

Harness (LLM Agents)

harnessagentsllmarchitectureharness-engineeringscaling-laws

Harness (LLM Agents)

The scaffolding around an LLM that turns it from a stateless next-token machine into a useful agent. Roughly: tools + context + memory + guardrails + observability.

The most-cited line, from Praveen Akkiraju in Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk):

"The agent IS the harness."

LLMs are stateless and steerable; the agent's behavior — and its safety — emerges almost entirely from the harness around it.

What's in a harness (per Praveen)

  • Tools — APIs, MCP servers, CLIs, file/browser/code execution
  • Context — conversation, scoped data access, role-based filtering
  • Memory — long-term, structured (e.g. wiki pages, entity stores)
  • Guardrails — prompt-level constraints, security/compliance, sandboxing
  • Observability — tracing at every step, not just final output (errors compound in multi-agent setups)
  • Eval — well-defined criteria for "good," supports continuous tuning
  • Agent identitywho the agent is acting as, what's delegated to it (per AWARE Framework / Governing AI Agents at Scale (Glean + Cvent, CXOTalk)). Traditional IAM doesn't model this; agents reason and delegate, none of which classical access control was designed for.

For an enterprise lens on these components, see the AWARE Framework — its 5 pillars (identity, context, guardrails, risk scoring, ecosystem observability) overlap heavily with this list. AWARE is the most concrete technical-controls breakdown of the harness in this wiki.

Examples in this wiki

  • Claude Code's harness: CLAUDE.md schema + slash commands + sub-agents + permission modes — all decisions about what to expose and how
  • Codex's harness: apply-patch tool + bash semantics + custom ESLint packages + persona-based reviewer agents on every push (see Harness Engineering (Ryan Lopopolo, AI Engineer))
  • OpenClaw's harness: Gateway + Adapters + Skills + agents.md/sole.md — see OpenClaw
  • Blitzy's harness: prompt-input governance + autonomous orchestration + CI/CD integration — see Blitzy
  • This Second Brain's harness for the LLM Wiki Pattern: CLAUDE.md (schema), index.md (retrieval), log.md (memory of operations), source-type frontmatter (structured metadata)

Harness Engineering as a discipline

Two AI Engineer London 2026 talks named the activity:

  • Harness Engineering (Ryan Lopopolo, AI Engineer)operational. Harness = surfacing instructions to the model at the right time. Just-in-time via lints, test failures, reviewer agents. Heavy structural enforcement (750-package PNPM workspace, package privacy, dependency-edge lints). Garbage Collection Day is the weekly ritual: turn every observed PR slop into a durable harness rule (lint, test, doc, reviewer-agent prompt). See also LLM as Fuzzy Compiler (the underlying mental model) and Code Is Free (the premise).
  • Context Is the New Code (Patrick Debois, AI Engineer)process. The Context Development Lifecycle (Generate → Evaluate → Distribute → Observe) is what wraps the harness over time. Patrick name-checked harness engineering on stage as a parallel discipline; the two talks are companion pieces.

These are the two most prescriptive sources in the wiki on how to do harness work day-to-day. Praveen, Karpathy, Boris, and Blitzy all describe the harness; Lopopolo and Debois describe how to build and improve one.

Open contradiction worth tracking

Three views in this wiki disagree about the harness's long-term importance:

Reconciliation: Praveen describes the present (jagged models need scaffolding). Boris predicts shrinking safety scaffolding (prompt injection guards, permission modes). Lopopolo argues productivity scaffolding (context-surfacing lints, reviewer agents) grows because each new model release leverages the existing investment. All three can be simultaneously right — they're talking about different parts of the harness. Re-evaluate after the next 1–2 model generations land.

Source convergence

Across Praveen Akkiraju, Andrej Karpathy, and Enrique Ibarra (Blitzy), three independent voices arrive at the same playbook: encode governance and intent as harness inputs upfront, not as post-hoc review. Praveen calls them .md policy files; Karpathy calls them specs/docs the agent works against; Blitzy bakes them into the prompt itself.

Harness scaling — what the right coordinate is

Scaling Laws for Agent Harnesses via Effective Feedback Compute (arxiv 2605.29682, May 2026) is the first source here that proposes a measurable scaling axis for harnesses. The paper argues raw expenditure (tokens, tool calls, ops, wall time, cost) is the wrong coordinate because it doesn't distinguish useful feedback from redundant or unstable interaction. Their alternative, Effective Feedback Compute (EFC), only credits feedback that is informative, valid, non-redundant, and retained for subsequent decisions, normalized by task demand.

Empirically, EFC-based coordinates beat raw-compute baselines (and a strong multivariate SAS baseline) at predicting failure rates across synthetic, code, real-benchmark, held-out, and prospective-validation tasks. Matched-budget interventions — improving feedback quality with raw cost and tool calls held fixed — raise success. The conclusion the paper draws:

"Harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback."

For this vault, EFC is the missing measurement that connects everything above to a number:

The "three views of harness importance" contradiction below is now empirically testable — if the EFC vs raw-compute gap closes on newer models, Boris's prediction wins; if it stays open, Lopopolo's investment-compounds thesis wins.

What top-performing agentic artifacts look like — MAC empirical, June 2026

Meta-Agent Challenge (Autonomous Agent Development Benchmark) asked a related question: when frontier code agents are tasked with autonomously designing an agent for SWE-Bench / Terminal-Bench / etc., what does the winning harness look like? Empirical answer, from qualitative artifact analysis:

  • Minimal ReAct loops over a small toolset — not tree search, not planner-worker pipelines
  • Prompt caching on every API call to minimize per-loop latency
  • Pre-search warming — populate context from issue symbols before the first LLM invocation
  • A singular verification nudge — force the model to verify all requirements before terminating
  • Adaptive time budgeting with checkpointing — failure mode in losers was timeout-with-no-partial-output

For reasoning artifacts: parallel sampling + majority voting + prompt diversification dominates. None of the top reasoning artifacts used the elaborate tree-search / planner-worker designs prevalent in the literature.

The implication is anti-elaboration: across two independent papers (EFC and MAC), the empirical answer for current frontier models is that minimal-but-well-tuned beats elaborate. This aligns with Boris's "shrinking harness" prediction at the agentic-loop layer, while Lopopolo's "compounding harness" thesis still holds at the workspace-rules layer (lints, reviewer agents, package policy). The two camps may be talking about different parts of the harness — see the open contradiction above.

Sources