When AI Becomes Its Own Scientist — Inside the Evolution Arena and the Rise of Autoresearch
Imagine an AI that doesn't wait for a human to tune its parameters. It proposes its own experiments, runs them against a live simulation, measures the outcome with a hard number, and decides on its own whether to keep the change. No vibes. No subjective review. Just: did the score go up?
This isn't a thought experiment. It's the core mechanic of Autoresearch, a pattern popularized by Andrej Karpathy in early 2026 and now spreading across dozens of domains — from GPU kernel optimization to Bitcoin price modeling to autonomous ML training. The Evolution Arena project takes this pattern and drops it into a survival game, creating one of the clearest, most tangible demonstrations of what happens when you hand an AI the keys to its own codebase and say: make yourself better.
The Autoresearch Pattern: A Loop That Never Quits
At its heart, Autoresearch is a brutally simple feedback loop. The agent operates in four phases:
1. Propose a change. The AI agent examines the current state of the code — steering logic, configuration values, decision thresholds — and hypothesizes a modification that might improve performance.
2. Run the simulation. The modified code is executed against the environment. In Evolution Arena, this means dropping the creature into the 2D world and letting it play.
3. Evaluate the score. A hard, mechanical metric determines whether the change helped. Not a language model's opinion. Not a human's gut feeling. A number.
4. Keep or revert. If the score improved, the change is committed. If it didn't, the change is rolled back to the last known-good state. Then the loop restarts.
This cycle runs indefinitely — or until you interrupt it. Karpathy's original implementation ran roughly 12 experiments per hour, producing around 100 experiments overnight while the researcher slept.
The Ratchet Mechanism: Why "Only Forward" Matters
The most critical architectural decision in any Autoresearch system is the Ratchet Script. This is the component that enforces monotonic improvement — the system can only move forward, never backward.
Why is this so important? Without the ratchet, an autonomous agent can easily oscillate. It improves in one dimension, then breaks something in another, then "fixes" the breakage by reverting the original improvement. You end up with a random walk instead of a climb. The ratchet transforms exploration into accumulation. Every successful experiment becomes the new floor. The agent can try wild, speculative changes — because if they fail, the system snaps back to its best-ever state.
This is what distinguishes Autoresearch from simple hyperparameter sweeps or random search. The agent isn't sampling from a grid. It's building on its own history, reading past results, identifying what worked, and choosing its next experiment based on the trajectory so far. Git acts as the memory. The results log acts as the experiment journal. The ratchet ensures the journal only records breakthroughs.
The Arena: A 2D Survival Testbed
Evolution Arena provides the simulation layer for this loop: a 2D environment where a creature must navigate a world of food and hazards.
The creature exists in a bounded space populated with two types of objects. Food items are scattered across the map — each one collected adds to the creature's score. Hazards are obstacles that deal damage on contact. The creature must develop logic to seek food efficiently while steering clear of danger.
The Scoring System: Teaching Caution to an Algorithm
The creature's fitness is calculated with a deliberately asymmetric formula:
score = (food_collected * 10) - (damage_taken * 2)
This balance is more subtle than it appears. A pure food-maximizer that ignores hazards will rack up damage penalties that erode its gains. A purely cautious creature that avoids all risk will starve for points. The 5:1 reward-to-penalty ratio pushes the agent toward confident but careful behavior — it should aggressively pursue food, but not at the cost of reckless collisions.
This scoring asymmetry is what makes the arena a meaningful testbed for Autoresearch. The agent isn't just optimizing a single variable; it's learning to balance competing pressures. Every code change it proposes must navigate this tradeoff. A new steering algorithm that doubles food intake but triples damage will be reverted by the ratchet. Only changes that improve the composite score survive.
The Role of the Agent: Editing Its Own DNA
Here is where Evolution Arena becomes genuinely fascinating. The AI agent — such as Claude Code running in a terminal — doesn't play the game in real time. Instead, it modifies the source code and configuration files that govern how the creature behaves.
The agent interacts primarily with two files:
game.pycontains the creature's steering logic, sensor processing, and movement rules. The agent can rewrite decision trees, adjust turn speeds, add wall-avoidance heuristics, or implement entirely new navigation strategies.config.jsonholds tunable parameters: vision range, movement speed, turn rate, how far ahead the creature looks for food, how aggressively it avoids hazards. The agent can tweak these values independently or in combination.
The Evaluation Wrapper orchestrates each experiment. It takes the agent's proposed changes, runs the simulation for a fixed number of steps, captures the score, and passes it back to the ratchet for the keep-or-revert decision. The wrapper ensures that every experiment is directly comparable — same number of steps, same initial conditions, same scoring formula.
What emerges is something that looks remarkably like evolution. Early iterations might produce simple changes: bump the vision range up, increase movement speed slightly. But as the easy gains are exhausted, the agent begins making structural changes — rewriting the pathfinding logic, introducing state machines for different behavioral modes, adding predictive avoidance for hazards that are still several steps away. The ratchet ensures that each of these structural leaps only sticks if it actually produces better outcomes.
Tech Stack: Deliberately Minimal
Evolution Arena is built to be forked, not admired. The entire stack is intentionally minimal.
Python 3.8+ for the game simulation, scoring, and evaluation wrapper. No heavy frameworks, no distributed training infrastructure. One file runs the game. One file scores it.
Claude Code CLI (or any compatible coding agent) as the researcher. The agent reads the codebase, proposes changes, and interacts with the file system directly through the terminal. It doesn't need a custom API or plugin — it just edits files and runs scripts.
Bash scripts for orchestration. The autoresearch.sh script ties the loop together: invoke the agent, run the simulation, evaluate, commit or revert, log, repeat. Bash is the glue because the loop itself should be trivially inspectable. You can read the orchestration script in under a minute and understand exactly what happens at each step.
This minimalism is intentional. Karpathy's original Autoresearch was a single GPU, one file, one metric. Evolution Arena follows the same philosophy: the simpler the infrastructure, the easier it is to understand what the agent is doing versus what the system is doing.
Future Implications: Beyond the Arena
Evolution Arena is a game, but the pattern it demonstrates is not. The Autoresearch loop — propose, test, score, keep or revert — is already being applied to problems far more consequential than 2D creature survival.
- ML training optimization. Karpathy's original repo has agents autonomously improving GPT training scripts, achieving compounding gains in validation loss across hundreds of unattended experiments.
- GPU kernel performance. Projects like AutoKernel use the same loop to find faster CUDA implementations — editing code, benchmarking throughput, and keeping only the changes that improve execution speed without breaking correctness.
- Database query tuning. Any system with a measurable latency metric and a configuration file is a candidate. Imagine an agent that continuously rewrites query plans, tests them against production-like workloads, and only deploys the ones that reduce p99 latency.
- Server and infrastructure configuration. Load balancer weights, cache eviction policies, connection pool sizes — numeric knobs that an Autoresearch loop could turn, test, and ratchet forward.
- Optimizing other AI models. This is the recursive frontier. An AI agent that rewrites the training code for another AI model, measures its benchmark performance, and keeps only the improvements. The Autoresearch pattern applied to Autoresearch itself.
The key insight is that any domain with a mechanical metric — a number you can compute without human judgment — is a candidate for this pattern. The arena's (food · 10) − (damage · 2) is a toy version of what could be p99_latency_ms, validation_bits_per_byte, inference_tokens_per_second, or revenue_per_user. The ratchet doesn't care what the number means. It only cares whether it went up.
Clone the Arena. See How High Your AI Can Score.
The Evolution Arena is open source, minimal, and designed to run on a single machine. If you've ever been curious about what happens when you point a coding agent at its own codebase and tell it to improve — this is the cleanest sandbox to find out.
git clone https://github.com/josephgec/evolution-arena.git
cd evolution-arena
# point your agent at the repo and start the loop
bash autoresearch.sh
Watch the results log. See the score climb. Pay attention to the kinds of changes the agent makes as easy gains dry up and it starts getting creative. That's when it gets interesting.
The arena is small. The pattern is not.
References
The foundational ideas and sibling projects this one builds on.
Autoresearch pattern
- Karpathy. Autoresearch — autonomous research loops for ML training. 2026. github.com/karpathy/autoresearch
- Awesome Autoresearch — curated list of Autoresearch applications across domains. github.com/yibie/awesome-autoresearch
- Anthropic. Claude Code. anthropic.com/claude-code
Related ideas: self-improving systems & search
- Real et al. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch. ICML 2020. arXiv:2003.03384
- Lehman & Stanley. Abandoning Objectives: Evolution through the Search for Novelty Alone. Evolutionary Computation, 2011. mit.edu
- Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023. arXiv:2303.17651
- Zhou et al. Language Agent Tree Search. ICML 2024. arXiv:2310.04406
Project source
- Evolution Arena on GitHub — github.com/josephgec/evolution-arena