Harness Engineering (Ryan Lopopolo, AI Engineer)

Ryan Lopopolo (member of technical staff at OpenAI) keynote on harness engineering at AI Engineer London 2026. Companion to a Latent Space podcast episode and OpenAI's harness-engineering blog post. The most operationalized recipe in this wiki for what it looks like to drive a team of agents day-to-day — and the OpenAI counterpart to Boris Cherny on Coding Is Solved (Sequoia AI Ascent).

Key claims

Code Is Free. Implementation is no longer scarce — token budgets and GPU capacity are. Each engineer "has access to 5, 50, or 5,000 engineers' worth of capacity 24/7." Skill sets shift to systems thinking, system design, and delegation
He banned his team from touching their editors. Everything goes through Codex; the local dev environment is wired in as skills the agent invokes (app launch, observability, Chrome DevTools attach via a local daemon)
The scarce resources are: human time, human/model attention, and model context window. Harness work is moving the synchronous human time into higher-leverage activities
Harness = surfacing instructions to the model at the right time. Don't front-load (you overwhelm); don't omit (the model can't comply with what it can't see). Just-in-time via lints, test failures, reviewer-agent comments
Reviewer agents on every push. Persona-based: front-end architect, scalability, reliability. Each judges through one lens, surfaces P2+ blockers. Implementation agent can acknowledge / defer / reject — don't make every comment a hard block or the agent gets bullied
Tests about source code, not just behavior. E.g. "no file > 350 lines" — adapts the codebase to the harness/model to be context-efficient
Garbage Collection Day. Every Friday his team takes the week's PR slop, asks why it kept happening, encodes the fix as a doc / lint / test / reviewer-agent prompt. Closes the loop between human review feedback and durable harness rules. This is the single most concrete operational ritual in the talk
Skills: 5–10, deepened over time. Not hundreds. Hide infrastructure churn inside skills so humans don't track it (he didn't notice for 3 weeks that they had moved from CDP-direct to a daemon-based browser attach — Codex just kept working)
Heavy structural enforcement. 750 packages in a PNPM workspace; package privacy + dependency-edge lints; deduplicate zod schemas; single canonical async helper. Gives the agent file-system-level hooks to scope where it works
Depend on first-party harnesses (Codex, Claude Code) directly. The labs post-train models in the context of these harnesses (apply-patch tool, bash invocation semantics) — depending on them via SDK/CLI rides the post-training leverage
LLM as Fuzzy Compiler. Code is a disposable build artifact. Constraints + lints + harness rules are the optimization passes the model "compiles" through. Swapping models is like swapping a compiler backend (LLVM → Cranelift) — should produce different bytes, same soundness
Token billionaire. Burns >$1,000/day in output tokens. Tokens roughly thirds: planning/ticket curation, implementation, CI work
Scaling the codebase. Started blank Electron app; ended up at 750 packages. The leap was structural — going full 10,000-engineer-organization heavy on the architecture even with a small team, so agents have hooks to scope work
Recommendations for getting started:
- Use agents to increase confidence in existing code (write the missing tests first; better-tested code is also better navigated by agents)
- Look at where you spend time and automate that part (CI, review, flaky tests, paging)
The dream: "50 agents running 24/7 and I don't have to interact with them at all." Every continue he types is a failure of the harness to encode "what done looks like"
Workflow vignette: kicks off a task, tethers laptop to phone, buckles laptop into the back seat of his car, lets it cook on the 30-min commute home

Quotable lines worth carrying forward

"Every time I have to type 'continue' to the agent is a failure of the harness."
"You can just simply say 'do not produce slop. Don't accept slop.' You won't get slop in your codebase."
"All the harness should do is surface instructions to the model at the right time."
"Code is a disposable build artifact."

Q&A highlights (with Vibhu Sapra, Latent Space)

On collaboration platforms — "largely just markdown files in the repo and GitHub." PR is the hub-and-spoke broadcast domain for humans + agents
On code review — moved from synchronous human review to garbage-collection-day-fed reviewer agents. Open PRs were causing merge-conflict misery; closing them faster (via reviewer agents instead of slow humans) was the real fix
On scaling org knowledge — same way of writing things wherever the agent looks (one bounded-concurrency helper, one ORM, one CI script style) so the model's tokens are predictable across the codebase
On plan mode — "my expectation is I should be able to drop a ticket in and have it do the job anyway." Plans you don't review are dangerous instructions. If you do use them, push as PRs and review every line
Future: take a token budget + a quarter's prioritized work + success metrics, hand to machines, let them advance the product without "hands explicitly on the wheel." Software engineering outside of writing code (triage, page response, runbooks, user feedback) becomes the meta-programming layer

Cross-source resonance

Companion to Context Is the New Code (Patrick Debois, AI Engineer) — same conference. Patrick's Context Development Lifecycle is the process view of context as a discipline; Ryan's harness engineering is the operational view of running on top of it. Both arrive at the same primitives (LLM as Judge for testing/review, packaged skills/lints for distribution, logs and PR comments for feedback)
Mirrors Boris Cherny on Coding Is Solved (Sequoia AI Ascent) at OpenAI scale. Both have hundreds of agents in flight, both write 0% of their code by hand, both treat first-party harnesses (Codex / Claude Code) as the substrate. Lopopolo is more prescriptive about org structure (750-package workspace, garbage collection day); Boris is more about parallel/loop primitives (sub-agents, /loop, /batch)
Sharpens the harness-trajectory contradiction on Harness (LLM Agents). Boris predicted harness becomes less important as models improve. Lopopolo argues harness becomes more leveraged because each new model release is a chance to re-leverage existing harness investment. Reconciliation: Boris is talking about safety scaffolding (prompt injection guards, permission modes); Lopopolo is talking about productivity scaffolding (context-surfacing lints, reviewer agents). The former may shrink; the latter likely grows
Code Is Free triangulates with Boris ("coding is solved") and Karpathy ("10× is not the speedup")
LLM as Fuzzy Compiler is the deepest articulation in the wiki of why you should engineer the harness, not the diff
Token Maxing — Lopopolo's "token billionaire" lifestyle is the personal version of what Praveen Akkiraju calls a pathology at the enterprise. His framing: tokens are an investment if your harness converts them into accepted PRs

Cross-links

Ryan Lopopolo · OpenAI · Codex · AI Engineer · Code Is Free · LLM as Fuzzy Compiler · LLM as Judge · Harness (LLM Agents) · Agentic Engineering · Context Is the New Code (Patrick Debois, AI Engineer) · Boris Cherny on Coding Is Solved (Sequoia AI Ascent)

Source

Harness Engineering How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI.md
Companion blog: https://openai.com/index/harness-engineering/