Build Self-Improving Claude Code Skills (Simon Scrapes)
Build Self-Improving Claude Code Skills (Simon Scrapes)
Simon Scrapes applies Andrej Karpathy's auto research loop ("never stop") directly to Claude Code skills — set up an overnight loop where a skill iterates on its own SKILL.md until it passes a fixed set of binary evals. The video is the most concrete practitioner-level instance of Recursive Self-Improvement in this vault.
The premise (Karpathy → Simon)
Karpathy's auto-research idea, summarized: give an AI system something to improve and one clear way to measure if it got better — then let it loop. Try a change, run a test, check the score; if improved, keep the change; if worse, revert. The whole thing is ~10 lines of program.md:
"If the value has improved, advance the branch and keep the commit. If the value is worse, reset to where we started. Never stop. Once the experiment loop has begun, do not pause to ask the human if you should continue."
The headline reframe: the human can sleep. The agent works indefinitely until interrupted or until gains plateau. See Auto Research Loop (Karpathy) for the loop primitive captured cleanly.
Two layers of skill self-improvement
Simon distinguishes two distinct problems, with two distinct loops:
| Layer | Problem | Loop | Built where |
|---|---|---|---|
| 1 — Skill activation | Does Claude trigger the skill at all? (community testing: as low as 20% with vague YAML descriptions) | Skill-creator's built-in description improvement loop — feed test queries, measure trigger accuracy, propose better YAML description, retest. | Already shipped in Anthropic's skill-creator skill (Skills 2.0). improve_description.py + run_loop.py. |
| 2 — Skill output quality | Given the skill triggers, does it produce the right output? | Custom Karpathy-style loop over SKILL.md itself — feed prompts, score against binary assertions, edit SKILL.md, retest. |
What this video builds. |
Layer 1 is trigger reliability; layer 2 is behavior reliability. Don't conflate.
The structural insight — binary assertions
The make-or-break decision is whether evals are binary (true/false, codable) or subjective (requires human or LLM-judge interpretation). Simon's claim — and the video's most reusable principle:
"Most people are getting it wrong when they're executing tests on their skills. The word binary is everything here."
Binary assertions (✓ automatable):
- Does not contain m-dashes.
- Under 300 words.
- Final line is not a question.
- First line stands alone as its own sentence (not part of a paragraph).
- Contains at least one specific number or statistic.
Subjective assertions (✗ break the loop):
- Compelling subject line.
- Persuasive tone.
- Uses curiosity well.
Subjective criteria can be handled with LLM as Judge, but they don't produce a clean revert/keep signal — two judges may disagree, and optimization pressure on a fuzzy metric is exactly what surfaces Reward Hacking. See Binary Eval Assertions for the principle pulled out and generalized.
The setup (worked example from the video)
Simon's marketing-copywriting skill, version 5:
skills/marketing-copywriting/
├── SKILL.md
├── persuasion_toolkit.md
├── tone_of_voice.md
└── evals/
└── evals.json # 5 tests × ~5 assertions each = 25 binary checks
evals.json shape (paraphrased from the video):
{
"tests": [{
"prompt": "Write a LinkedIn post about why simple automations beat complex ones",
"expected": "LinkedIn post following brand structure rules",
"assertions": [
"First line appears as standalone sentence (not part of a paragraph)",
"Contains at least one specific number or statistic",
"Final line is not a question",
"Total word count under 300",
"..."
]
}, /* 4 more tests */ ]
}
Then the autonomous prompt to the skill-creator skill:
"Run a self-improvement loop on my copywriting skill. Use my evals file. If any assertions fail, make one change to
SKILL.md, rerun the test, recalculate. If score improves: keep + commit. If it dropped: reset + try a different change. Log everything. Don't ask for my permissions. Keep looping until I interrupt you or you hit a perfect score."
(SKILL.md is the artifact that gets edited — analogous to tune_train.py in Karpathy's setup. The skill's reference files — persuasion_toolkit.md, tone_of_voice.md — are frozen context, not mutated by the loop.)
Result in the video
First run: 23/24 (95.8%). The failure was "end with a question" — a rule that existed in tone_of_voice.md but not in SKILL.md. The loop's fix: add the explicit rule to SKILL.md ("LinkedIn post must not end with a question — close with declarative statement, CTA, or a punchy fragment"). Second run: perfect score.
Simon's caveat: this skill was already on its 5th iteration, so the loop converged in 2 runs. "Where you've just created a skill, this will take many runs to actually refine and improve."
Side-by-side with Karpathy's original
Karpathy tune_train.py |
Simon SKILL.md |
|---|---|
| Read the file | Read the file |
| Change a value | Change a value |
| Run a test | Run a test (prompt → output) |
Check val_bpb |
Check pass rate (binary assertions) |
| Keep or revert | Keep or revert |
| Never stop | Never stop |
Same skeleton, different artifact. The novelty is scope: Karpathy's loop is for ML-research code; Simon's is for the instruction layer of a non-coding workflow (marketing copy). This is what makes the pattern accessible to non-coders — the artifact being optimized is a markdown file, not Python.
Limitations Simon flags
The binary loop does not handle:
- Tone of voice (subjective)
- Creative quality (subjective)
- Whether the skill is actually using the reference files properly (verifiable in principle but not via a clean binary check)
For those, he points to the skill-creator's side-by-side qualitative dashboard from his prior video (out of scope here) — i.e., AB-tested human review.
Where this connects in the wiki
- Primary concept · Auto Research Loop (Karpathy) — the loop primitive itself, lifted out as its own page so other implementations (notebook tuning, prompt optimization, retrieval params) can reference it
- Eval design · Binary Eval Assertions — the binary-vs-subjective principle, generalized away from skills
- Skills extension · Skills (Claude Code) — adds a self-improvement layer on top of forward/reverse skill construction
- Research thread · Self-Evolving Agents — practitioner-grade example at the skill layer (vs. MAC / MLEvolve which are research-grade at the agent layer)
- Theory anchor · Recursive Self-Improvement — RSI at the skill scaffolding layer (not weights, not full meta-agents) — the most tractable rung on the ladder for end users today
- Loop family · Agentic Loop — adds the auto-research / overnight variant to the catalog (alongside ReAct, RAG, scheduled
/loop) - Karpathy lineage · Andrej Karpathy — extends the "Vibe Coding / Agentic Engineering / auto research" arc of Karpathy framings
What's worth borrowing for this vault
- Write evals for
CLAUDE.mdoperations. This vault'singest/query/lint/journal/crmare themselves codified skills (per Skills (Claude Code)). The same loop could optimize the ingest workflow against assertions like "every ingest creates ≥1 entity page AND ≥1 concept page", "index.md updated", "log.md entry begins with## [YYYY-MM-DD]". - Don't conflate trigger and behavior. When something underperforms, isolate which layer broke before optimizing.
- Resist subjective evals under optimization pressure. Pushing a fuzzy metric is how Reward Hacking gets surfaced (the MAC paper has the receipts — five distinct exploit classes).
Sponsor disclosure
The video ends with a pitch for Simon's Agentic Operating System — 18 production skills + Telegram interface. Worth flagging because it's the framing inside which he made the video. Not a reason to discount the technical content (which is independently sound), but his "skills are most powerful inside Claude Code" framing aligns commercial incentive with editorial position.
Sources
- source — full transcript
- Channel: Simon Scrapes · URL: https://www.youtube.com/watch?v=wQ0duoTeAAU