When AI Becomes Its Own Scientist — Inside the Evolution Arena and the Rise of Autoresearch

This isn't a thought experiment. It's the core mechanic of Autoresearch — a pattern popularized by Andrej Karpathy in early 2026 in which an AI coding agent runs the full experimental cycle on its own: it edits source code, runs the resulting program, reads a hard score, and decides whether to keep the edit. No human in the inner loop. The Evolution Arena project takes this pattern and drops it into a 2D survival game — one of the clearest, most tangible demonstrations of what happens when you hand an AI the keys to its own codebase and say: make yourself better.

Before this. Tuning a research codebase used to mean a human in the loop. A grad student would change a hyperparameter, run the script, eyeball a loss curve or a sample, form an opinion, and try again — maybe ten experiments in a working day. The bottleneck was always the human's attention and judgment. Autoresearch replaces the human with two pieces of code: a metric that returns a number and a ratchet that compares the number to the best one so far. The agent can now run all night without supervision, because "did it get better" no longer requires a person to answer.

The Autoresearch Pattern: A Loop That Never Quits

At its heart, Autoresearch is a brutally simple feedback loop. The agent operates in four phases:

1. Propose a change. The AI agent examines the current state of the code — steering logic, configuration values, decision thresholds — and hypothesizes a modification that might improve performance.

2. Run the simulation. The modified code is executed against the environment. In Evolution Arena, this means dropping the creature into the 2D world and letting it play.

3. Evaluate the score. A hard, mechanical metric determines whether the change helped. Not a language model's opinion. Not a human's gut feeling. A number.

4. Keep or revert. If the score improved, the change is committed. If it didn't, the change is rolled back to the last known-good state. Then the loop restarts.

This cycle runs indefinitely — or until you interrupt it. Karpathy's original implementation ran roughly 12 experiments per hour, producing around 100 experiments overnight while the researcher slept. For comparison, a focused researcher driving the same loop by hand might manage eight to fifteen experiments in a full working day — and only while awake. The agent isn't faster per experiment; it just doesn't stop.

The Autoresearch loop. Four phases, one direction of travel, and no human in the inner cycle.

The Ratchet Mechanism: Why "Only Forward" Matters

The most critical architectural decision in any Autoresearch system is the Ratchet Script. This is the component that enforces monotonic improvement — the recorded best score can only go up, never down. Like the toothed wheel in a socket wrench, the system clicks forward when it finds a better score and locks; it physically cannot regress.

Why is this so important? Without the ratchet, an autonomous agent can easily oscillate. It improves in one dimension, then breaks something in another, then "fixes" the breakage by reverting the original improvement. You end up with a random walk instead of a climb. The ratchet transforms exploration into accumulation. Every successful experiment becomes the new floor. The agent can try wild, speculative changes — because if they fail, the system snaps back to its best-ever state.

Two simulated runs of the same agent on the same task. Left: every change is committed, so a regression in experiment #8 wipes out gains from #6. Right: the ratchet reverts losing experiments, so the best-so-far line is the only thing that matters — and it only climbs.

This is what distinguishes Autoresearch from simple hyperparameter sweeps or random search. The agent isn't sampling from a grid. It's building on its own history, reading past results, identifying what worked, and choosing its next experiment based on the trajectory so far. Git acts as the memory. The results log acts as the experiment journal. The ratchet ensures the journal only records breakthroughs.

Score over experiments. The staircase is the best score seen so far; grey dots are failed experiments that the ratchet reverted. The line never drops.

The Arena: A 2D Survival Testbed

Evolution Arena provides the simulation layer for this loop: a 2D environment where a creature must navigate a world of food and hazards.

The creature exists in a bounded space populated with two types of objects. Food items are scattered across the map — each one collected adds to the creature's score. Hazards are obstacles that deal damage on contact. The creature must develop logic to seek food efficiently while steering clear of danger.

The arena, top-down. The creature has only a forward vision cone — its sensors, its turn rate, and its risk tolerance are all knobs the agent can rewrite. Every experiment starts from the same seed so two runs are directly comparable.

The Scoring System: Teaching Caution to an Algorithm

The creature's fitness is computed by a single line of Python at the end of every run. food_collected is just a counter incremented when the creature touches a green pellet; damage_taken accumulates each time it touches a red spike. The formula weights the two:

score = (food_collected * 10) - (damage_taken * 2)

This balance is more subtle than it appears. A pure food-maximizer that ignores hazards will rack up damage penalties that erode its gains. A purely cautious creature that avoids all risk will starve for points. The 5:1 reward-to-penalty ratio pushes the agent toward confident but careful behavior — it should aggressively pursue food, but not at the cost of reckless collisions.

This scoring asymmetry is what makes the arena a meaningful testbed for Autoresearch. The agent isn't just optimizing a single variable; it's learning to balance competing pressures. Every code change it proposes must navigate this tradeoff. A new steering algorithm that doubles food intake but triples damage will be reverted by the ratchet. Only changes that improve the composite score survive.

The Role of the Agent: Editing Its Own DNA

The AI agent — Claude Code running in a terminal — doesn't play the game in real time. Instead, it modifies the source code and configuration files that govern how the creature behaves.

The agent interacts primarily with two files:

game.py contains the creature's steering logic, sensor processing, and movement rules. The agent can rewrite decision trees, adjust turn speeds, add wall-avoidance heuristics, or implement entirely new navigation strategies.
config.json holds tunable parameters: vision range, movement speed, turn rate, how far ahead the creature looks for food, how aggressively it avoids hazards. The agent can tweak these values independently or in combination.

The Evaluation Wrapper orchestrates each experiment. It takes the agent's proposed changes, runs the simulation for a fixed number of steps, captures the score, and passes it back to the ratchet for the keep-or-revert decision. The wrapper ensures that every experiment is directly comparable — same number of steps, same initial conditions, same scoring formula.

What emerges is evolution: random variation filtered by a fitness function. Early iterations produce simple changes — bump the vision range up, increase movement speed slightly. But as the easy gains are exhausted, the agent begins making structural changes — rewriting the pathfinding logic, introducing state machines for different behavioral modes, adding predictive avoidance for hazards that are still several steps away. The ratchet ensures that each of these structural leaps only sticks if it actually produces better outcomes.

The concrete pieces. The agent never plays the game itself — it edits the source. The wrapper runs the simulation deterministically and reports a single number. The ratchet, a few lines of bash, decides whether that number earns a git commit or a git checkout.

Tech Stack: Deliberately Minimal

Python 3.8+ Claude Code CLI Bash Git

Evolution Arena is built to be forked, not admired. The entire stack is intentionally minimal.

Python 3.8+ for the game simulation, scoring, and evaluation wrapper. No heavy frameworks, no distributed training infrastructure. One file runs the game. One file scores it.

Claude Code CLI (or any compatible coding agent) as the researcher. The agent reads the codebase, proposes changes, and interacts with the file system directly through the terminal. It doesn't need a custom API or plugin — it just edits files and runs scripts.

Bash scripts for orchestration. The autoresearch.sh script ties the loop together: invoke the agent, run the simulation, evaluate, commit or revert, log, repeat. Bash is the glue because the loop itself should be trivially inspectable. You can read the orchestration script in under a minute and understand exactly what happens at each step.

This minimalism is intentional. Karpathy's original Autoresearch was a single GPU, one file, one metric. Evolution Arena follows the same philosophy: the simpler the infrastructure, the easier it is to understand what the agent is doing versus what the system is doing.

Future Implications: Beyond the Arena

Evolution Arena is a game, but the pattern it demonstrates is not. The Autoresearch loop — propose, test, score, keep or revert — is already being applied to problems far more consequential than 2D creature survival.

ML training optimization. Karpathy's original repo has agents autonomously improving GPT training scripts, achieving compounding gains in validation loss across hundreds of unattended experiments.
GPU kernel performance. Projects like AutoKernel use the same loop to find faster CUDA implementations — editing code, benchmarking throughput, and keeping only the changes that improve execution speed without breaking correctness.
Database query tuning. Any system with a measurable latency metric and a configuration file is a candidate. Imagine an agent that continuously rewrites query plans, tests them against production-like workloads, and only deploys the ones that reduce p99 latency.
Server and infrastructure configuration. Load balancer weights, cache eviction policies, connection pool sizes — numeric knobs that an Autoresearch loop could turn, test, and ratchet forward.
Optimizing other AI models. This is the recursive frontier. An AI agent that rewrites the training code for another AI model, measures its benchmark performance, and keeps only the improvements. The Autoresearch pattern applied to Autoresearch itself.

The key insight is that any domain with a mechanical metric — a number you can compute without human judgment — is a candidate for this pattern. The arena's (food · 10) − (damage · 2) is a toy version of what could be p99_latency_ms, validation_bits_per_byte, inference_tokens_per_second, or revenue_per_user. The ratchet doesn't care what the number means. It only cares whether it went up.

Clone the Arena. See How High Your AI Can Score.

The Evolution Arena is open source, minimal, and designed to run on a single machine. If you've ever been curious about what happens when you point a coding agent at its own codebase and tell it to improve — this is the cleanest sandbox to find out.

git clone https://github.com/josephgec/evolution-arena.git
cd evolution-arena
# point your agent at the repo and start the loop
bash autoresearch.sh

Watch the results log. See the score climb. Pay attention to the kinds of changes the agent makes as easy gains dry up and it starts getting creative. That's when it gets interesting.

The arena is small. The pattern is not.

References

The foundational ideas and sibling projects this one builds on.

Autoresearch pattern

Karpathy. Autoresearch — autonomous research loops for ML training. 2026. github.com/karpathy/autoresearch
Awesome Autoresearch — curated list of Autoresearch applications across domains. github.com/yibie/awesome-autoresearch
Anthropic. Claude Code. anthropic.com/claude-code

Related ideas: self-improving systems & search

Real et al. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch. ICML 2020. arXiv:2003.03384
Lehman & Stanley. Abandoning Objectives: Evolution through the Search for Novelty Alone. Evolutionary Computation, 2011. mit.edu
Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023. arXiv:2303.17651
Zhou et al. Language Agent Tree Search. ICML 2024. arXiv:2310.04406

Project source

Evolution Arena on GitHub — github.com/josephgec/evolution-arena