Back to Writing

Two AIs, One Loop — Building a Self-Improving Code Agent

A well-designed two-agent architecture — one LLM planning, another implementing, with git diffs and test results flowing between them — captures most of the value of multi-agent coding systems while avoiding their complexity. Here's what I built, and what the research literature says about why it works.

LLMs are remarkable at taking a bounded request and producing code. But real software engineering isn't bounded. Building a feature means planning, coding, testing, discovering you got it wrong, and trying again. It's iterative by nature.

Tools like Claude Code already iterate internally — gather context, take action, verify, repeat. But they mix planning and coding in a single context window. When one instance does both, there's a strong bias toward jumping to code even when the right move is to stop and think. What if you could split the roles and give each one room to do its job fully?

I wanted to see what would happen if I closed the loop myself. Not a fancy multi-agent framework. Not a task tree. Just two Claude Code instances wired together in a cycle, with an orchestrator forwarding context, diffs, and test results between them until the job is done.

What I found, both in my own experiments and in the surprisingly large body of research that already exists on this question, is that the architecture matters less than I expected — and the engineering decisions around it matter much more.

The Landscape

Before I started, I wanted to understand what others had built. The space turns out to be larger and more contradictory than it looks.

On one end, there are role-based multi-agent frameworks like MetaGPT (5+ roles passing structured artifacts through a software-development pipeline) and ChatDev (7 roles communicating through pairwise "atomic chats"). Both treat software development as a virtual company with a division of labor. Both produce impressive demos and complete projects in under 7 minutes for under a dollar.

On the other end, there are deceptively powerful single-agent systems. SWE-Agent showed that optimizing the Agent-Computer Interface — custom commands, structured search, linting on edit — matters more than agent count. Its mini-swe-agent variant achieves 65% on SWE-bench Verified in roughly 100 lines of Python. Claude Code itself, the tool I built on, scores 80.8% on SWE-bench Verified as a single agent paired with strong tooling.

In the middle sits the architecture I ended up with: two agents, one plans, one executes, with a tight feedback loop between them. The closest production implementation is Aider's Architect/Editor mode, which uses an Architect model (e.g., o1-preview) to plan changes and an Editor model (e.g., Claude 3.5 Sonnet) to execute the specific edits, with git tracking every change. It scored 85% on Aider's code editing benchmark — the highest recorded score on that suite, and a useful sanity check that the pattern is sound.

The industry trend through 2025 has actually moved away from heavy multi-agent orchestration toward "powerful single agents with excellent tooling." Devin, despite the $4B valuation and "first AI software engineer" branding, faced scrutiny when an independent Answer.AI evaluation found a 15% success rate on real tasks. The agents that perform best on benchmarks rely on strong models paired with well-designed interfaces, not on agent count.

So why bother with two agents at all? Two reasons, both grounded in research.

Why Separate Planning from Execution

The case for splitting planning from execution — even just into two roles — is empirically strong. Several independent research lines converge on the same finding:

Anthropic's own "Building Effective Agents" post (December 2024) names two of its five recommended patterns directly: Orchestrator-Workers and Evaluator-Optimizer. A Planner/Implementer loop is exactly the Evaluator-Optimizer pattern with a small twist — the evaluator and the planner are the same agent, holding the long-running plan in its head while reviewing each implementation step. As Anthropic put it: "the most successful implementations weren't using complex frameworks... they were building with simple, composable patterns."

The dual-process framing from cognitive science maps cleanly onto the two roles. The Implementer is System 1 — fast, reactive, focused on execution. The Planner is System 2 — slower, deliberate, focused on what should happen next and why. The split isn't strictly necessary, but it's recognizably useful, and the research above suggests the small extra latency is worth it.

There's also a practical reason. A single Claude instance asked to both plan and code has a strong bias toward writing code, even when the right move is to stop and think. Splitting the roles forces each instance to do its job fully before handing off. I noticed this within an hour of testing — the moment I gave one model both jobs, it stopped reasoning about whether the previous step was actually correct.

Two Roles, One Goal

The architecture is the simplest thing that works: two Claude Code instances, each with a different system prompt, each specialized for one part of the cycle.

The Planner is the senior tech lead. It sees the current state of the repository — file tree, README, recent git log, the last diff produced, the test output — and decides what needs to happen next. It doesn't write code. Its job is to think, explain, and hand off a clear instruction. Its output is a natural-language directive like "add a JWT middleware at src/auth/jwt.ts that validates the access token and sets req.user. Don't touch the refresh flow yet — we'll do that next."

The Implementer is the engineer. It receives the directive from the Planner and executes it. It reads the files it needs, makes the edits, runs the tests, and reports back what it changed. It does not second-guess the plan. Its job is to make the diff happen cleanly.

Architecture

The orchestrator is a thin Python shim. It holds the conversation state for each role, captures git diffs and test output between rounds, and forwards a curated context packet to each Planner turn.

User Goal Orchestrator gather context · forward diffs & tests context packet Planner decide the next step directive Implementer make the changes Feedback git diff · test results · output next iteration — or GOAL COMPLETE → exit
The cycle. Each pass produces a diff and a fresh test result, both of which feed the next Planner turn.

Context Engineering

Context management turned out to be the single most important engineering decision. Bigger context windows do not automatically mean better results — the "lost in the middle" effect (Liu et al., Stanford, TACL 2024) shows models drop 30%+ accuracy on content in the middle of long contexts, even with 1M-token windows.

The context packet sent to the Planner before each turn includes:

Context packet built fresh each iteration STATIC CONTEXT — changes slowly File tree repo structure Git log last 15 commits README project intent FEEDBACK SIGNAL — plan vs. reality Previous diff + test output what changed last round · what passed · what failed Planner
Three slow-changing slices plus one dynamic feedback signal — built fresh every iteration.

What you exclude matters as much as what you include. The packet deliberately omits full file contents, dependency trees, and generated documentation. Recent research on AGENTS.md-style context files (early 2026) found that LLM-generated context actually decreased agent success rates and increased inference costs by over 20%. The shallow packet keeps the Planner focused on the delta — what changed and what to do next — rather than drowning it in static detail.

The test output is the part that took me the longest to get right, and it's also the part I'm most certain about. AlphaCodium (Ridnik et al., 2024) reported an unexpected negative result: "injecting the last git patch diff to the prompt: no improvement was seen." Raw diffs alone are not a useful feedback signal — the Planner has no way to know whether the diff was actually correct. Diffs paired with test results are a different story. The diff says what changed; the test results say whether it worked.

I still want to do better here. Aider's repository map system uses tree-sitter to parse 26+ languages into ASTs and applies personalized PageRank to identify the most relevant symbols relative to the current edit. The result is a scope-aware elided code view that fits in a small token budget. My orchestrator doesn't do this yet — it just sends the full file tree. PageRank-based context selection is the highest-value thing I'd add next.

A counterintuitive finding from JetBrains Research (NeurIPS 2025 DL4C Workshop) is worth flagging: observation masking — hiding verbose tool outputs from older turns while preserving the action and reasoning history — matched or beat LLM summarization in 4 of 5 settings, while being simpler and cheaper. Both approaches cut token costs by 50%+ compared to unmanaged context. The lesson: don't summarize what you can hide.

The Feedback Loop

The loop is where the system earns its keep. After the Implementer reports back, the orchestrator runs git diff and the project's test command, then pipes both into the next Planner round.

That's when the interesting things start happening. The Planner notices that the Implementer added a route but forgot to register it in the router. It notices that a new function was added but no tests, and that the existing tests still pass — meaning the change isn't actually exercised. It notices that an edit touched an unrelated file by accident. It writes the next directive with all of that in hand: "good work on the middleware — now wire it into src/app.ts and add a test for the invalid-token case."

This matches the broader research consensus that external feedback signals are essential for iterative improvement to work. Self-Refine (Madaan et al., NeurIPS 2023) showed ~20% average improvement from generate-feedback-refine loops. Self-Debugging (Chen et al., ICLR 2024) gained up to +12% accuracy with execution feedback. Reflexion (Shinn et al., NeurIPS 2023) pushed HumanEval to 91% pass@1 by adding episodic memory of past failures.

But the same literature contains a sharp warning: intrinsic self-correction without external feedback degrades performance. "Large Language Models Cannot Self-Correct Reasoning Yet" (Huang et al., ICLR 2024) showed LLMs sometimes change correct answers to wrong ones when asked to self-evaluate without grounding. A TACL 2024 survey (Kamoi et al.) concluded that "no prior work demonstrates successful self-correction with feedback from prompted LLMs" — only with grounded signals like code execution.

The implication for any Planner/Implementer architecture is direct: the feedback loop is only as good as the external signals you wire into it. Diffs without tests are just opinion. Tests without diffs are missing the cause. You need both.

ITERATION 1 Plan: scaffold routes + middleware Implement +4 files +180 lines ITERATION 2 Plan: fix gaps wire router + tests Implement +2 files, +60 lines ITERATION 3 Plan: polish edge case + cleanup Implement ~20 lines scope of changes narrows each round ✓ GOAL COMPLETE
A typical 3-iteration run. Each round produces a smaller diff than the last as the Planner converges on the goal.

Running It

A full run looks like this:

python3 claude_loop.py ./my-app "Add JWT auth with refresh tokens" \
  --effort max \
  --planner-model opus \
  --impl-model opus

The key options:

Knowing When to Stop

A loop that doesn't know when to stop is worse than no loop at all.

The empirical evidence on iteration is consistent: most improvement happens in the first 1–2 iterations, with rapidly diminishing returns. Self-Refine's own data shows the bulk of gain in iterations 0→2. Self-Debugging plateaus at 2–3 turns. AlphaCodium captures the largest gains on the first test-fixing pass.

There are also failure modes that get worse with more iterations:

Termination in my orchestrator combines four signals, in order of priority:

  1. The Planner declares victory. After reviewing the most recent diff and test output, if the Planner concludes the goal is achieved, it emits a GOAL COMPLETE sentinel and the orchestrator exits cleanly.
  2. External validation gates pass. If all the project's tests pass, the linter is clean, and the diff matches the goal, that's strong evidence to stop.
  3. No-change detection. If two consecutive Implementer turns produce no meaningful diff, the loop has stalled and there's nothing more to learn from continuing.
  4. Iteration cap. Set with --max-iterations, default 5. The escape hatch for runs stuck in a local minimum.

In practice, the first two mechanisms handle roughly 80% of completed runs. The cap catches the pathological cases.

The other 20% usually fails in a more subtle way: the Planner emits GOAL COMPLETE because the tests pass, but the feature is quietly incomplete — an edge case unhandled, an error path uncovered, a refactor that left dead code behind. A more principled termination check might dispatch a verification prompt to a third Claude instance, asking it to grade the final state against the original goal — though that adds the kind of complexity I deliberately tried to avoid.

Recent theoretical work on Markovian Generation Chains formalizes the intuition behind early termination — under greedy decoding, iterative LLM output rapidly converges to fixed points or short cycles. That's both an explanation for why three iterations is usually enough and a warning about why long runs don't help.

What You Can Build

I've used the loop on a range of tasks. Some it handles cleanly, others reveal the seams.

Typical runs by the numbers. A well-scoped bug fix completes in 2–3 iterations and costs roughly $0.50–$2.00 in API usage. A full feature like adding JWT auth runs 4–6 iterations at $3–$8. Test coverage campaigns are the most iteration-heavy — sometimes 8–10 rounds — but each round is cheap because the Implementer is mostly writing straightforward test code. Token usage scales roughly linearly with iteration count: about 15K–25K tokens per round for the Planner and 20K–40K for the Implementer.

Works well

Works poorly

What I'd Build Next

The architecture works. The interesting questions now are about the things around it.

Get Started

The full source is on GitHub: github.com/josephgec/claude-agent.

There's no installation beyond cloning the repo and having claude in your PATH. The orchestrator is a single Python file. Fork it, modify the system prompts, tune it to your taste — the architecture is simple enough that you can reshape it without fighting a framework.

If you try it on something interesting, I'd love to hear what worked and what broke. The most valuable feedback so far has been about the termination criterion — the Planner's GOAL COMPLETE call is fuzzy, and there's a more principled version hiding somewhere in the validation-gate plus no-change-detection combination.

References

A reference list of the research, blog posts, and open-source projects this article draws on. Most arXiv links open directly to the paper.

Multi-agent coding systems

Iterative self-refinement

Planning and execution

Context and termination

Benchmarks & tools

Stay in the loop. New posts on AI agents, distributed systems & research.
Subscribe