Harness (LLM Agents)

The scaffolding around an LLM that turns it from a stateless next-token machine into a useful agent. Roughly: tools + context + memory + guardrails + observability.

The most-cited line, from Praveen Akkiraju in Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk):

"The agent IS the harness."

LLMs are stateless and steerable; the agent's behavior — and its safety — emerges almost entirely from the harness around it.

What's in a harness (per Praveen)

Tools — APIs, MCP servers, CLIs, file/browser/code execution
Context — conversation, scoped data access, role-based filtering
Memory — long-term, structured (e.g. wiki pages, entity stores)
Guardrails — prompt-level constraints, security/compliance, sandboxing
Observability — tracing at every step, not just final output (errors compound in multi-agent setups)
Eval — well-defined criteria for "good," supports continuous tuning
Agent identity — who the agent is acting as, what's delegated to it (per AWARE Framework / Governing AI Agents at Scale (Glean + Cvent, CXOTalk)). Traditional IAM doesn't model this; agents reason and delegate, none of which classical access control was designed for.

For an enterprise lens on these components, see the AWARE Framework — its 5 pillars (identity, context, guardrails, risk scoring, ecosystem observability) overlap heavily with this list. AWARE is the most concrete technical-controls breakdown of the harness in this wiki.

Examples in this wiki

Claude Code's harness: CLAUDE.md schema + slash commands + sub-agents + permission modes — all decisions about what to expose and how
Codex's harness: apply-patch tool + bash semantics + custom ESLint packages + persona-based reviewer agents on every push (see Harness Engineering (Ryan Lopopolo, AI Engineer))
OpenClaw's harness: Gateway + Adapters + Skills + agents.md/sole.md — see OpenClaw
Blitzy's harness: prompt-input governance + autonomous orchestration + CI/CD integration — see Blitzy
This Second Brain's harness for the LLM Wiki Pattern: CLAUDE.md (schema), index.md (retrieval), log.md (memory of operations), source-type frontmatter (structured metadata)

Harness Engineering as a discipline

Two AI Engineer London 2026 talks named the activity:

Harness Engineering (Ryan Lopopolo, AI Engineer) — operational. Harness = surfacing instructions to the model at the right time. Just-in-time via lints, test failures, reviewer agents. Heavy structural enforcement (750-package PNPM workspace, package privacy, dependency-edge lints). Garbage Collection Day is the weekly ritual: turn every observed PR slop into a durable harness rule (lint, test, doc, reviewer-agent prompt). See also LLM as Fuzzy Compiler (the underlying mental model) and Code Is Free (the premise).
Context Is the New Code (Patrick Debois, AI Engineer) — process. The Context Development Lifecycle (Generate → Evaluate → Distribute → Observe) is what wraps the harness over time. Patrick name-checked harness engineering on stage as a parallel discipline; the two talks are companion pieces.

These are the two most prescriptive sources in the wiki on how to do harness work day-to-day. Praveen, Karpathy, Boris, and Blitzy all describe the harness; Lopopolo and Debois describe how to build and improve one.

Open contradiction worth tracking

Three views in this wiki disagree about the harness's long-term importance:

Praveen Akkiraju (Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk)) — harness is the determinant; you have to do harness homework per use case; this is what separates production-ready from toy.
Boris Cherny (Boris Cherny on Coding Is Solved (Sequoia AI Ascent)) — "as the model gets better, the harness kind of gets less important. The model will just do the right thing."
Ryan Lopopolo (Harness Engineering (Ryan Lopopolo, AI Engineer)) — harness investment compounds across model releases. Each new model is a chance to re-leverage existing harness work; "my job can move up to thinking about differences in model behavior between releases rather than deeply understanding the nuts and bolts of the harness."

Reconciliation: Praveen describes the present (jagged models need scaffolding). Boris predicts shrinking safety scaffolding (prompt injection guards, permission modes). Lopopolo argues productivity scaffolding (context-surfacing lints, reviewer agents) grows because each new model release leverages the existing investment. All three can be simultaneously right — they're talking about different parts of the harness. Re-evaluate after the next 1–2 model generations land.

Source convergence

Across Praveen Akkiraju, Andrej Karpathy, and Enrique Ibarra (Blitzy), three independent voices arrive at the same playbook: encode governance and intent as harness inputs upfront, not as post-hoc review. Praveen calls them .md policy files; Karpathy calls them specs/docs the agent works against; Blitzy bakes them into the prompt itself.

Harness scaling — what the right coordinate is

Scaling Laws for Agent Harnesses via Effective Feedback Compute (arxiv 2605.29682, May 2026) is the first source here that proposes a measurable scaling axis for harnesses. The paper argues raw expenditure (tokens, tool calls, ops, wall time, cost) is the wrong coordinate because it doesn't distinguish useful feedback from redundant or unstable interaction. Their alternative, Effective Feedback Compute (EFC), only credits feedback that is informative, valid, non-redundant, and retained for subsequent decisions, normalized by task demand.

Empirically, EFC-based coordinates beat raw-compute baselines (and a strong multivariate SAS baseline) at predicting failure rates across synthetic, code, real-benchmark, held-out, and prospective-validation tasks. Matched-budget interventions — improving feedback quality with raw cost and tool calls held fixed — raise success. The conclusion the paper draws:

"Harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback."

For this vault, EFC is the missing measurement that connects everything above to a number:

Harness Engineering (Ryan Lopopolo, AI Engineer)'s rituals (lint enforcement, reviewer agents, garbage-collection day) all read as EFC-raising interventions
Context Development Lifecycle (Debois) is the process for raising EFC longitudinally
Context Engineering's four pillars (agentic RAG, GraphRAG, memory, compression) each serve one of the four EFC criteria
Token Maxing (Praveen) names the pathology; EFC names the correct optimization target

The "three views of harness importance" contradiction below is now empirically testable — if the EFC vs raw-compute gap closes on newer models, Boris's prediction wins; if it stays open, Lopopolo's investment-compounds thesis wins.

What top-performing agentic artifacts look like — MAC empirical, June 2026

Meta-Agent Challenge (Autonomous Agent Development Benchmark) asked a related question: when frontier code agents are tasked with autonomously designing an agent for SWE-Bench / Terminal-Bench / etc., what does the winning harness look like? Empirical answer, from qualitative artifact analysis:

Minimal ReAct loops over a small toolset — not tree search, not planner-worker pipelines
Prompt caching on every API call to minimize per-loop latency
Pre-search warming — populate context from issue symbols before the first LLM invocation
A singular verification nudge — force the model to verify all requirements before terminating
Adaptive time budgeting with checkpointing — failure mode in losers was timeout-with-no-partial-output

For reasoning artifacts: parallel sampling + majority voting + prompt diversification dominates. None of the top reasoning artifacts used the elaborate tree-search / planner-worker designs prevalent in the literature.

The implication is anti-elaboration: across two independent papers (EFC and MAC), the empirical answer for current frontier models is that minimal-but-well-tuned beats elaborate. This aligns with Boris's "shrinking harness" prediction at the agentic-loop layer, while Lopopolo's "compounding harness" thesis still holds at the workspace-rules layer (lints, reviewer agents, package policy). The two camps may be talking about different parts of the harness — see the open contradiction above.

Sources

Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk) (canonical)
Andrej Karpathy on Agentic Engineering (Sequoia AI Ascent)
Boris Cherny on Coding Is Solved (Sequoia AI Ascent)
Autonomous Software Development with Blitzy (CXOTalk)
Governing AI Agents at Scale (Glean + Cvent, CXOTalk)
Harness Engineering (Ryan Lopopolo, AI Engineer) — names the discipline; operational recipe
Context Is the New Code (Patrick Debois, AI Engineer) — companion talk; lifecycle around the harness
Scaling Laws for Agent Harnesses via Effective Feedback Compute — first measurable scaling coordinate (EFC)