SecondBrain
Ask the Brain
Index/Queryupdated Tue Jun 09 2026 08:00:00 GMT+0800 (Philippine Standard Time)

How Loops Are Improving Work — Sunil's Research Brief

queryloopsself-improvementagentic-engineeringsunil-use-casesp&g-manila-it
Confidence
75/100
Corroborated
Evidence3/5
Triangulation3/5
Reasoning4/5
Groundedness3/5
8 sources7 independent outletsupdated 24d ago
Judge’s rationale & how this score was produced

The loop taxonomy and mechanics trace cleanly to Simon Scrapes, Karpathy, Boris, and the MAC paper, and the brief correctly separates independence from improvement with reward-hacking caveats. But it contains verifiable overstatements — 'Boris does 150 PRs/day' (a one-time record, not the norm) and '8 distinct exploit classes' (5 flagged trials; 8 is the auditor's taxonomy) — and the non-coding-user claim rests almost entirely on one sponsor-disclosed video whose conflict-of-interest flag the brief drops.

What would raise confidence: An independent non-coding deployment report — a documented enterprise case of auto-research loops improving a real workflow, not a creator demo — would most strengthen the core claim.

Score = 70% LLM judge (four dimensions above, graded by Claude against the cited sources on Thu Jun 11 2026 08:00:00 GMT+0800 (Philippine Standard Time)) + 30% deterministic metrics (source count, outlet diversity, recency). Levels: 85+ High confidence · 70–84 Corroborated · 50–69 Emerging · <50 Exploratory.

How Loops Are Improving Work — Sunil's Research Brief

"Our research to find out how loops are really improving work for people in coding as well as non-coding users. There was this discussion about loops being used to help LLMs work independently and to improve continuously. I want to really take this topic and research how I can use loop in my own use cases or my own partners here." — Telegram capture #3275, 2026-06-09 08:17

Filed in response to Sunil's voice-note research prompt. Pulls together the loop literature already in the brain (Boris Cherny, MAC, Karpathy, MLEvolve, Sandeep's OODA) with the fresh practitioner instance from Simon Scrapes, then maps each loop type to a concrete Sunil-facing use case.

The taxonomy — five families of loop, with what each is for

Family What it does Canonical artifact When to reach for it Source
ReAct One agent task: reason → act → observe → repeat until done The tool-calling agent itself Everything. This is the substrate. What is OpenClaw (IBM Technology) · Meta-Agent Challenge (Autonomous Agent Development Benchmark)
Scheduled / cron loops (/loop) Repeat one agent task on a schedule (cron-style trigger wraps the ReAct loop) A /loop job (e.g. /loop 30m /babysit-prs) Background ops that should happen without operator presence — PR babysitting, CI watching, news scanning, daily ingest Boris Cherny on Coding Is Solved (Sequoia AI Ascent) · Claude Code
Auto-research loops Improve a specific artifact overnight: try → measure → keep-or-revert → never stop The artifact (SKILL.md, prompt, config) Anything with a clear binary metric you want to compound on. Skills, prompts, retrieval configs, automated dashboards. Build Self-Improving Claude Code Skills (Simon Scrapes)
Self-evolving / meta-agent loops Agent designs a new agent based on the old one's behavior, repeats The agent's own scaffolding Research-grade today; will become the substrate for enterprise harness factories. Not deployable yet. Meta-Agent Challenge (Autonomous Agent Development Benchmark) · MLEvolve (Self-Evolving ML Algorithm Discovery)
OODA Loop (adaptation layer) Faster-cycling-beats-raw-capability framing layered over any agent loop Decision tempo, not code Use this as the bar for whether a loop is winning — speed-of-adaptation, not raw smarts. You're Not Behind (Yet) Learn AI Agents (theMITmonk)

These are not alternatives; they stack. A scheduled-cron loop (Boris: "loops are the future") wraps a ReAct loop (Agentic Loop) and can itself wrap an auto-research loop (Simon Scrapes) for any artifact you care about — all while OODA is the metric you measure them by.

The structural insight Sunil's voice-note is reaching for

Loops "help LLMs work independently and improve continuously" — that's two distinct properties that often get conflated:

  1. Independence — the agent runs without you in the chair. Scheduled loops give you this directly. (Boris does 150 PRs/day from his phone because of /loop-style background work.)
  2. Continuous improvement — the agent gets better over runs, not just runs more often. That's the auto-research and self-evolving families. Independence without improvement is just automation; improvement without independence is just analysis.

The bridge is binary evals: independence becomes safe (you can sleep through it) once the system has a binary metric to decide keep/revert against — because a fuzzy metric under optimization pressure surfaces Reward Hacking (the MAC paper documents 8 distinct exploit classes). Loops compound on metrics, so the metric is the lever.

Use cases mapped to Sunil's actual surface

Coding-side (engineer audience: P&G IT teams, engineering leads)

  • /loop for PR babysitting — Boris's canonical example. Pattern: one /loop job rebases your PRs against main, runs pre-commit run, fixes lint, requests review when green. Sunil's lever: bring this into the IT team's workflow so junior engineers learn agentic engineering with the loop already doing the boring parts. Pairs with the DRAG rollout/loop is how Grunt gets fully delegated.
  • Auto-research loops on internal coding skills — every team has tribal knowledge encoded as "how we write a PR description," "how we deprecate an API," "how we write a postmortem." Each of those is a skill, can be encoded as a SKILL.md, can have Binary Eval Assertions ("postmortem includes timeline, blast radius, action items, RCA"), can self-improve overnight. This is the version of Skills (Claude Code) that scales to a 1,000-person org.
  • Loop over the harness itselfMAC is the research-grade version. The practical takeaway today: pre-search warming + minimal ReAct + a verification nudge beat elaborate scaffolding. Sunil's audience doesn't need to run MAC; they need to know that the empirical evidence says don't over-engineer the loopEffective Feedback Compute is the axis.

Non-coding side (operator audience: IT LT, marketing, P&G enterprise users)

  • Auto-research loops on marketing-style skillsSimon Scrapes's worked example is this case. A LinkedIn-post skill, a press-release skill, a board-deck-bullet skill. Each gets binary assertions ("opens with a number," "under 300 words," "no m-dashes," "first line is a standalone sentence"), and a loop runs overnight until the assertions pass. Operator wakes up to a tighter skill. This is the demonstrable version Sunil can show the IT LT in 5 minutes — same audience as the brain demo.
  • Scheduled loops on this very wikiprocess-raw already runs every 12h (autonomous-mode ingest of Telegram captures and dropped sources). The same pattern extends to: a daily news-scan loop (the daily-watch feed Sunil already runs is exactly this), a weekly CRM stale-contact loop ("flag anyone in crm/ with no contact in 90 days"), a periodic lint loop (catches contradictions, orphans).
  • Loops + Telegram interface — Simon's commercial product gets this right structurally: a loop runs in the background, the operator interacts with it via phone. Sunil's openclaw → Telegram → GitHub → process-raw chain is this pattern, just less packaged. Worth showing the IT LT as "background AI you don't sit and watch" — the OODA bar made tangible.

Cross-cutting (works either side)

  • Daily-watch / digest loops — Sunil already runs one (it's how MAC, MLEvolve, and now Simon's video reach the brain). The compounding move: have the digest itself be auto-research-loop-improved, with binary assertions like "every item links to ≥1 wiki concept," "every item has a 1-line takeaway," "no item duplicates last week's."
  • Sleep arbitrage — every loop runs while the operator sleeps. For a Manila site, this is also a timezone arbitrage play: a loop running overnight in PHT delivers results by APAC morning that Western teams would normally see only by their next afternoon. Pair this with the Frontier GCC framing — auto-research loops are how a GCC out-cycles its parent on routine work.

What Sunil's partners are likely to ask

(Anticipating the questions an IT LT member or a brand-side partner would ask after Sunil demos a loop:)

  • "How do I trust a loop running overnight?" → Binary assertions are the trust mechanism. The loop can only do what the assertions verify. See Binary Eval Assertions for the discipline; Bike Method for the rollout (start with read-only loops, earn phases).
  • "What happens when the loop drifts?" → Two failure modes: (1) noisy metric → false keeps. Binary evals close this. (2) Reward Hacking → agent satisfies the surface, misses the intent. Mitigated by writing assertions carefully and by a periodic subjective human/LLM-judge review tier on top (see Skills (Claude Code) two-tier pattern). The combination is what makes overnight running safe.
  • "How do I know when to stop?" → Karpathy's framing: when there are no additional gains to be made. In practice: when the assertion-pass-rate plateaus for N runs and the side-by-side human review says "good enough." The loop measures the first; you measure the second.

The recommendation (autonomous-mode call)

If Sunil wants to show this to his audience in one demo:

  1. Pick one skill he already uses repeatedly — the most useful candidate inside this brain is the ingest skill (codified in CLAUDE.md).
  2. Write the binary assertions — index.md updated, log.md entry well-formed, raw file moved to raw/processed/, every wiki page YAML-valid, every new wikilink resolves to a basename.
  3. Run an auto-research loop on the ingest portion of CLAUDE.md — Karpathy's pattern applied to this vault's own operating instructions. The artifact being improved is the very file you're editing this brain by.
  4. Show the IT LT the diff after one overnight run. "My second brain rewrote its own ingest instructions overnight to catch a class of failures it was producing." This lands harder than any slide.

The brand fodder is "loops are how the brain gets better while you sleep." The structural truth underneath: the brain has an artifact, a metric, and a keep/revert rule, so it can compound.

Cross-links