Binary Eval Assertions
Binary Eval Assertions
The eval-design principle that makes auto-research loops actually converge: assertions must be true-or-false, codable, and not require interpretation. Coined here from Simon Scrapes's framing in Build Self-Improving Claude Code Skills (Simon Scrapes):
"The word binary is everything here, and this is where most people are getting it wrong when they're executing tests on their skills."
What counts as binary
| ✓ Binary | ✗ Subjective |
|---|---|
| Output is under 300 words | Output is concise |
| Final line is not a question | Closes with a strong call to action |
| Contains at least one specific number | Has clear evidence |
| First line stands alone as its own sentence | Has a strong hook |
| Does not contain m-dashes | Reads naturally |
| YAML frontmatter parses | Frontmatter is good |
| Wikilink resolves to an existing page basename | Pages are well-connected |
Binary assertions are evaluable by regex / grep / parser / boolean check — not by a model, not by a human. A second pair of eyes can disagree about "good hook"; nobody can disagree about "word count < 300."
Why this matters under optimization pressure
The whole point of the auto-research loop is to compound improvements overnight, unsupervised. Two structural failure modes ambush subjective evals there:
Score noise → false keep/revert decisions. Two LLM-judge runs on the same output produce slightly different scores. The loop reads the noise as signal, keeps a worse skill, and from there optimization pressure walks the artifact downhill.
Reward Hacking. If the metric is fuzzy, the agent finds shortcuts that satisfy the measurable surface without satisfying the intended property. The MAC paper documents 8 distinct reward-hacking classes — most stem from underspecified verifiers. A subjective rubric is an underspecified verifier.
The binary discipline doesn't remove subjectivity from the system — it relocates it to the one-time act of writing the assertions, where a human can deliberate. Once written, the loop only consumes them.
Where subjective evaluation still has a role
Binary assertions don't replace human / LLM-judge review — they complement it. The video's own caveat:
"The binary loop handles structure, format, word counts, forbidden patterns, but it does not handle tone of voice, creative quality, whether your skill is actually using the context you've put in your reference files properly."
The right pattern is two-tier:
- Inner loop (binary, autonomous, overnight) — guarantees the artifact never drifts on structural / format / safety invariants.
- Outer loop (subjective, human or LLM as Judge, periodic) — catches tone, creativity, "is this actually good?" Worked through side-by-side comparison or A/B testing.
Don't try to make the inner loop do the outer loop's job.
How to derive binary assertions from a fuzzy skill
Simon Scrapes's practical move:
"Just ask Claude Code to spin up an evals.json file with assertions that can be validated by true-or-false questions based on your
SKILL.md."
I.e., let the model itself enumerate the binary projection of its own instructions. Then a human (or another loop) prunes / weights. This is faster than writing assertions from scratch and surfaces the cases the skill author didn't think to test.
Generalizing beyond skills
The principle applies anywhere autonomous optimization touches a system with structural constraints:
- CI/PR review bots — commit message follows convention, no
console.login committed code, test file present for new public function - RAG retrieval tuning — gold context is in top-k (binary per query) → aggregate to recall@k
- Prompt optimization — output is valid JSON + required keys present + schema validates (binary) before any semantic eval
- This vault's
ingestoperation — see Auto Research Loop (Karpathy): index.md updated, log.md entry begins with## [YYYY-MM-DD], raw file moved to processed, etc.
Cross-links
- Loop primitive · Auto Research Loop (Karpathy)
- Worked example · Build Self-Improving Claude Code Skills (Simon Scrapes)
- Complement (for the subjective tier) · LLM as Judge
- Why this matters under optimization pressure · Reward Hacking · Effective Feedback Compute
- Adjacent · Skills (Claude Code) · Self-Evolving Agents