Jagged Intelligence
llmcapabilitiesjagged-intelligenceverifiability
Jagged Intelligence
The observation that frontier LLMs simultaneously demonstrate superhuman capability in some domains and trivially fail in others — capability profile is spiky, not smooth.
Two independent sources in this wiki use the term, suggesting it's becoming standard vocabulary for this phenomenon.
Two canonical examples
- Karpathy's car wash (Andrej Karpathy on Agentic Engineering (Sequoia AI Ascent)):
"How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a 100,000-line codebase or find zero-day vulnerabilities and yet tells me to walk to a 50-meter car wash? This is insane."
- Strawberry letter-counting — older but same shape; mostly patched now.
Why it happens (Karpathy's hypothesis)
- Verifiable domains scale via RL. Frontier models are trained in giant RL environments with verification rewards. Math, code, and adjacent verifiable tasks get RL'd hard → capability peaks there.
- Lab focus matters too. Capabilities often follow what labs decided to put in the data distribution, not just what's verifiable. Chess GPT-3.5 → GPT-4 is the example: someone at OpenAI added a lot of chess data, and capability peaked.
- Combined: verifiable + lab cares. If you're in the circuits the labs RL'd, you fly. If you're outside, you struggle.
Implications
- For founders (Andrej Karpathy on Agentic Engineering (Sequoia AI Ascent)): identify verifiable domains the labs aren't focused on; build RL environments; fine-tune for spiky capability.
- For enterprise (Agentic AI in the Enterprise (Praveen Akkiraju, CXOTalk)): jaggedness is why the Harness (LLM Agents) matters so much. Harness encodes the missing context, guardrails, and observability to keep the agent in the circuits where it works.
- For users: stay in the loop. Treat the model as a tool, verify outputs, especially in unfamiliar territory.
Karpathy's broader framing
The "ghosts not animals" piece — these aren't intelligences with intrinsic motivation, they're statistical simulation circuits. Yelling at them doesn't help; understanding what's in their training distribution does.