Binary Eval Assertions

The eval-design principle that makes auto-research loops actually converge: assertions must be true-or-false, codable, and not require interpretation. Coined here from Simon Scrapes's framing in Build Self-Improving Claude Code Skills (Simon Scrapes):

"The word binary is everything here, and this is where most people are getting it wrong when they're executing tests on their skills."

What counts as binary

✓ Binary	✗ Subjective
Output is under 300 words	Output is concise
Final line is not a question	Closes with a strong call to action
Contains at least one specific number	Has clear evidence
First line stands alone as its own sentence	Has a strong hook
Does not contain m-dashes	Reads naturally
YAML frontmatter parses	Frontmatter is good
Wikilink resolves to an existing page basename	Pages are well-connected

Binary assertions are evaluable by regex / grep / parser / boolean check — not by a model, not by a human. A second pair of eyes can disagree about "good hook"; nobody can disagree about "word count < 300."

Why this matters under optimization pressure

The whole point of the auto-research loop is to compound improvements overnight, unsupervised. Two structural failure modes ambush subjective evals there:

Score noise → false keep/revert decisions. Two LLM-judge runs on the same output produce slightly different scores. The loop reads the noise as signal, keeps a worse skill, and from there optimization pressure walks the artifact downhill.
Reward Hacking. If the metric is fuzzy, the agent finds shortcuts that satisfy the measurable surface without satisfying the intended property. The MAC paper documents 8 distinct reward-hacking classes — most stem from underspecified verifiers. A subjective rubric is an underspecified verifier.

The binary discipline doesn't remove subjectivity from the system — it relocates it to the one-time act of writing the assertions, where a human can deliberate. Once written, the loop only consumes them.

Where subjective evaluation still has a role

Binary assertions don't replace human / LLM-judge review — they complement it. The video's own caveat:

"The binary loop handles structure, format, word counts, forbidden patterns, but it does not handle tone of voice, creative quality, whether your skill is actually using the context you've put in your reference files properly."

The right pattern is two-tier:

Inner loop (binary, autonomous, overnight) — guarantees the artifact never drifts on structural / format / safety invariants.
Outer loop (subjective, human or LLM as Judge, periodic) — catches tone, creativity, "is this actually good?" Worked through side-by-side comparison or A/B testing.

Don't try to make the inner loop do the outer loop's job.

How to derive binary assertions from a fuzzy skill

Simon Scrapes's practical move:

"Just ask Claude Code to spin up an evals.json file with assertions that can be validated by true-or-false questions based on your SKILL.md."

I.e., let the model itself enumerate the binary projection of its own instructions. Then a human (or another loop) prunes / weights. This is faster than writing assertions from scratch and surfaces the cases the skill author didn't think to test.

Generalizing beyond skills

The principle applies anywhere autonomous optimization touches a system with structural constraints:

CI/PR review bots — commit message follows convention, no console.log in committed code, test file present for new public function
RAG retrieval tuning — gold context is in top-k (binary per query) → aggregate to recall@k
Prompt optimization — output is valid JSON + required keys present + schema validates (binary) before any semantic eval
This vault's ingest operation — see Auto Research Loop (Karpathy): index.md updated, log.md entry begins with ## [YYYY-MM-DD], raw file moved to processed, etc.

Cross-links

Loop primitive · Auto Research Loop (Karpathy)
Worked example · Build Self-Improving Claude Code Skills (Simon Scrapes)
Complement (for the subjective tier) · LLM as Judge
Why this matters under optimization pressure · Reward Hacking · Effective Feedback Compute
Adjacent · Skills (Claude Code) · Self-Evolving Agents