Back to Writing

When AI Becomes Its Own Scientist — Inside the Evolution Arena and the Rise of Autoresearch

Imagine an AI that doesn't wait for a human to tune its parameters. It proposes its own experiments, runs them against a live simulation, measures the outcome with a hard number, and decides on its own whether to keep the change. No vibes. No subjective review. Just: did the score go up?

This isn't a thought experiment. It's the core mechanic of Autoresearch, a pattern popularized by Andrej Karpathy in early 2026 and now spreading across dozens of domains — from GPU kernel optimization to Bitcoin price modeling to autonomous ML training. The Evolution Arena project takes this pattern and drops it into a survival game, creating one of the clearest, most tangible demonstrations of what happens when you hand an AI the keys to its own codebase and say: make yourself better.

The Autoresearch Pattern: A Loop That Never Quits

At its heart, Autoresearch is a brutally simple feedback loop. The agent operates in four phases:

1. Propose a change. The AI agent examines the current state of the code — steering logic, configuration values, decision thresholds — and hypothesizes a modification that might improve performance.

2. Run the simulation. The modified code is executed against the environment. In Evolution Arena, this means dropping the creature into the 2D world and letting it play.

3. Evaluate the score. A hard, mechanical metric determines whether the change helped. Not a language model's opinion. Not a human's gut feeling. A number.

4. Keep or revert. If the score improved, the change is committed. If it didn't, the change is rolled back to the last known-good state. Then the loop restarts.

This cycle runs indefinitely — or until you interrupt it. Karpathy's original implementation ran roughly 12 experiments per hour, producing around 100 experiments overnight while the researcher slept.

1 · PROPOSE Edit code or config agent reads game.py and config.json hypothesizes a change 2 · RUN Run simulation creature plays N fixed steps collects food · takes damage 3 · EVALUATE Mechanical score score = food · 10 − damage · 2 no opinions · just a number 4 · KEEP / REVERT The Ratchet if score > best: commit else: revert only forward. never back. restart loop ~12 experiments / hour · runs unattended overnight
The Autoresearch loop. Four phases, one direction of travel, and no human in the inner cycle.

The Ratchet Mechanism: Why "Only Forward" Matters

The most critical architectural decision in any Autoresearch system is the Ratchet Script. This is the component that enforces monotonic improvement — the system can only move forward, never backward.

Why is this so important? Without the ratchet, an autonomous agent can easily oscillate. It improves in one dimension, then breaks something in another, then "fixes" the breakage by reverting the original improvement. You end up with a random walk instead of a climb. The ratchet transforms exploration into accumulation. Every successful experiment becomes the new floor. The agent can try wild, speculative changes — because if they fail, the system snaps back to its best-ever state.

This is what distinguishes Autoresearch from simple hyperparameter sweeps or random search. The agent isn't sampling from a grid. It's building on its own history, reading past results, identifying what worked, and choosing its next experiment based on the trajectory so far. Git acts as the memory. The results log acts as the experiment journal. The ratchet ensures the journal only records breakthroughs.

100 75 50 25 0 score 0 25 50 75 100 experiment # failed experiments (reverted) breakthrough: structural refactor best-so-far only moves up
Score over experiments. The staircase is the best score seen so far; grey dots are failed experiments that the ratchet reverted. The line never drops.

The Arena: A 2D Survival Testbed

Evolution Arena provides the simulation layer for this loop: a 2D environment where a creature must navigate a world of food and hazards.

The creature exists in a bounded space populated with two types of objects. Food items are scattered across the map — each one collected adds to the creature's score. Hazards are obstacles that deal damage on contact. The creature must develop logic to seek food efficiently while steering clear of danger.

The Scoring System: Teaching Caution to an Algorithm

The creature's fitness is calculated with a deliberately asymmetric formula:

score = (food_collected * 10) - (damage_taken * 2)

This balance is more subtle than it appears. A pure food-maximizer that ignores hazards will rack up damage penalties that erode its gains. A purely cautious creature that avoids all risk will starve for points. The 5:1 reward-to-penalty ratio pushes the agent toward confident but careful behavior — it should aggressively pursue food, but not at the cost of reckless collisions.

This scoring asymmetry is what makes the arena a meaningful testbed for Autoresearch. The agent isn't just optimizing a single variable; it's learning to balance competing pressures. Every code change it proposes must navigate this tradeoff. A new steering algorithm that doubles food intake but triples damage will be reverted by the ratchet. Only changes that improve the composite score survive.

The Role of the Agent: Editing Its Own DNA

Here is where Evolution Arena becomes genuinely fascinating. The AI agent — such as Claude Code running in a terminal — doesn't play the game in real time. Instead, it modifies the source code and configuration files that govern how the creature behaves.

The agent interacts primarily with two files:

The Evaluation Wrapper orchestrates each experiment. It takes the agent's proposed changes, runs the simulation for a fixed number of steps, captures the score, and passes it back to the ratchet for the keep-or-revert decision. The wrapper ensures that every experiment is directly comparable — same number of steps, same initial conditions, same scoring formula.

What emerges is something that looks remarkably like evolution. Early iterations might produce simple changes: bump the vision range up, increase movement speed slightly. But as the easy gains are exhausted, the agent begins making structural changes — rewriting the pathfinding logic, introducing state machines for different behavioral modes, adding predictive avoidance for hazards that are still several steps away. The ratchet ensures that each of these structural leaps only sticks if it actually produces better outcomes.

Tech Stack: Deliberately Minimal

Python 3.8+ Claude Code CLI Bash Git

Evolution Arena is built to be forked, not admired. The entire stack is intentionally minimal.

Python 3.8+ for the game simulation, scoring, and evaluation wrapper. No heavy frameworks, no distributed training infrastructure. One file runs the game. One file scores it.

Claude Code CLI (or any compatible coding agent) as the researcher. The agent reads the codebase, proposes changes, and interacts with the file system directly through the terminal. It doesn't need a custom API or plugin — it just edits files and runs scripts.

Bash scripts for orchestration. The autoresearch.sh script ties the loop together: invoke the agent, run the simulation, evaluate, commit or revert, log, repeat. Bash is the glue because the loop itself should be trivially inspectable. You can read the orchestration script in under a minute and understand exactly what happens at each step.

This minimalism is intentional. Karpathy's original Autoresearch was a single GPU, one file, one metric. Evolution Arena follows the same philosophy: the simpler the infrastructure, the easier it is to understand what the agent is doing versus what the system is doing.

Future Implications: Beyond the Arena

Evolution Arena is a game, but the pattern it demonstrates is not. The Autoresearch loop — propose, test, score, keep or revert — is already being applied to problems far more consequential than 2D creature survival.

The key insight is that any domain with a mechanical metric — a number you can compute without human judgment — is a candidate for this pattern. The arena's (food · 10) − (damage · 2) is a toy version of what could be p99_latency_ms, validation_bits_per_byte, inference_tokens_per_second, or revenue_per_user. The ratchet doesn't care what the number means. It only cares whether it went up.

Clone the Arena. See How High Your AI Can Score.

The Evolution Arena is open source, minimal, and designed to run on a single machine. If you've ever been curious about what happens when you point a coding agent at its own codebase and tell it to improve — this is the cleanest sandbox to find out.

git clone https://github.com/josephgec/evolution-arena.git
cd evolution-arena
# point your agent at the repo and start the loop
bash autoresearch.sh

Watch the results log. See the score climb. Pay attention to the kinds of changes the agent makes as easy gains dry up and it starts getting creative. That's when it gets interesting.

The arena is small. The pattern is not.

References

The foundational ideas and sibling projects this one builds on.

Autoresearch pattern

Related ideas: self-improving systems & search

Project source