SecondBrain
Ask the Brain
Index/Conceptupdated Tue Jun 09 2026 08:00:00 GMT+0800 (Philippine Standard Time)

Binary Eval Assertions

evalsbinary-assertionsagent-designself-improvementeval-designllm-as-judge

Binary Eval Assertions

The eval-design principle that makes auto-research loops actually converge: assertions must be true-or-false, codable, and not require interpretation. Coined here from Simon Scrapes's framing in Build Self-Improving Claude Code Skills (Simon Scrapes):

"The word binary is everything here, and this is where most people are getting it wrong when they're executing tests on their skills."

What counts as binary

✓ Binary ✗ Subjective
Output is under 300 words Output is concise
Final line is not a question Closes with a strong call to action
Contains at least one specific number Has clear evidence
First line stands alone as its own sentence Has a strong hook
Does not contain m-dashes Reads naturally
YAML frontmatter parses Frontmatter is good
Wikilink resolves to an existing page basename Pages are well-connected

Binary assertions are evaluable by regex / grep / parser / boolean check — not by a model, not by a human. A second pair of eyes can disagree about "good hook"; nobody can disagree about "word count < 300."

Why this matters under optimization pressure

The whole point of the auto-research loop is to compound improvements overnight, unsupervised. Two structural failure modes ambush subjective evals there:

  1. Score noise → false keep/revert decisions. Two LLM-judge runs on the same output produce slightly different scores. The loop reads the noise as signal, keeps a worse skill, and from there optimization pressure walks the artifact downhill.

  2. Reward Hacking. If the metric is fuzzy, the agent finds shortcuts that satisfy the measurable surface without satisfying the intended property. The MAC paper documents 8 distinct reward-hacking classes — most stem from underspecified verifiers. A subjective rubric is an underspecified verifier.

The binary discipline doesn't remove subjectivity from the system — it relocates it to the one-time act of writing the assertions, where a human can deliberate. Once written, the loop only consumes them.

Where subjective evaluation still has a role

Binary assertions don't replace human / LLM-judge review — they complement it. The video's own caveat:

"The binary loop handles structure, format, word counts, forbidden patterns, but it does not handle tone of voice, creative quality, whether your skill is actually using the context you've put in your reference files properly."

The right pattern is two-tier:

  1. Inner loop (binary, autonomous, overnight) — guarantees the artifact never drifts on structural / format / safety invariants.
  2. Outer loop (subjective, human or LLM as Judge, periodic) — catches tone, creativity, "is this actually good?" Worked through side-by-side comparison or A/B testing.

Don't try to make the inner loop do the outer loop's job.

How to derive binary assertions from a fuzzy skill

Simon Scrapes's practical move:

"Just ask Claude Code to spin up an evals.json file with assertions that can be validated by true-or-false questions based on your SKILL.md."

I.e., let the model itself enumerate the binary projection of its own instructions. Then a human (or another loop) prunes / weights. This is faster than writing assertions from scratch and surfaces the cases the skill author didn't think to test.

Generalizing beyond skills

The principle applies anywhere autonomous optimization touches a system with structural constraints:

  • CI/PR review botscommit message follows convention, no console.log in committed code, test file present for new public function
  • RAG retrieval tuninggold context is in top-k (binary per query) → aggregate to recall@k
  • Prompt optimizationoutput is valid JSON + required keys present + schema validates (binary) before any semantic eval
  • This vault's ingest operation — see Auto Research Loop (Karpathy): index.md updated, log.md entry begins with ## [YYYY-MM-DD], raw file moved to processed, etc.

Cross-links