jthomas.site// notebook · v.4.2026
back to writing

Two AIs, One Loop — Building a Self-Improving Code Agent

A well-designed two-agent architecture — one LLM planning, another implementing, with git diffs and test results flowing between them — captures most of the value of multi-agent coding systems while avoiding their complexity. Here's what I built, and what the research literature says about why it works.

LLMs are remarkable at taking a bounded request and producing code. But real software engineering isn't bounded. Building a feature means planning, coding, testing, discovering you got it wrong, and trying again. It's iterative by nature.

Tools like Claude Code are already agents — large language models running inside a loop, allowed to read files, run shell commands, and check results before responding. They iterate internally: gather context, take action, verify, repeat. But the same instance does both the planning and the coding inside a single context window. When one model holds both jobs at once, there's a strong bias toward jumping to code even when the right move is to stop and think. What if you split the roles and gave each one room to do its job fully?

I wanted to see what would happen if I closed the loop myself. Not a fancy multi-agent framework with five or seven specialised roles arguing through structured artifacts. Not a task tree. Just two Claude Code instances wired together in a cycle, with an orchestrator forwarding context, git diffs (the textual record of what changed in the code, line by line), and test results between them until the job is done.

What I found, both in my own experiments and in the surprisingly large body of research that already exists on this question, is that the architecture matters less than I expected — and the engineering decisions around it matter much more.

Before this. Earlier coding agents fell into two camps. The first was the single-agent loop: one LLM in a long-running ReAct cycle, gathering context, calling tools, and editing files until it declared done. As the conversation grew, the model's context window filled up with raw tool output, "lost in the middle" effects kicked in, and the agent forgot what it was originally doing. The second camp was the multi-agent swarm — MetaGPT, ChatDev, AutoGen — five to seven role-played agents passing structured artifacts through a virtual software company. They produced impressive demos but spent most of their tokens on inter-agent ceremony, and an Anthropic post-mortem on agent failures (2025) traced most of the wasted compute to coordination overhead, not to the underlying coding work. The two-agent split is the smallest architecture that gets the benefit of separation without paying the swarm tax.

The Landscape

Before I started, I wanted to understand what others had built. The space turns out to be larger and more contradictory than it looks.

On one end, there are role-based multi-agent frameworks like MetaGPT (5+ roles passing structured artifacts through a software-development pipeline) and ChatDev (7 roles communicating through pairwise "atomic chats"). Both treat software development as a virtual company with a division of labor. Both produce impressive demos and complete projects in under 7 minutes for under a dollar.

On the other end, there are deceptively powerful single-agent systems. SWE-Agent showed that optimizing the Agent-Computer Interface — custom commands, structured search, linting on edit — matters more than agent count. Its mini-swe-agent variant achieves 65% on SWE-bench Verified in roughly 100 lines of Python. Claude Code itself, the tool I built on, scores 80.8% on SWE-bench Verified as a single agent paired with strong tooling.

In the middle sits the architecture I ended up with: two agents, one plans, one executes, with a tight feedback loop between them. The closest production implementation is Aider's Architect/Editor mode, which uses an Architect model (e.g., o1-preview) to plan changes and an Editor model (e.g., Claude 3.5 Sonnet) to execute the specific edits, with git tracking every change. It scored 85% on Aider's code editing benchmark — the highest recorded score on that suite, and a useful sanity check that the pattern is sound.

The industry trend through 2025 has actually moved away from heavy multi-agent orchestration toward "powerful single agents with excellent tooling." Devin, despite the $4B valuation and "first AI software engineer" branding, faced scrutiny when an independent Answer.AI evaluation found a 15% success rate on real tasks. The agents that perform best on benchmarks rely on strong models paired with well-designed interfaces, not on agent count.

So why bother with two agents at all? Two reasons, both grounded in research.

Single-agent loop one Claude · plans & codes & tests CONTEXT WINDOW user goal tool: ls / read / grep ... ×40 raw file contents · 12k tokens stack traces · diffs · stdout "wait — what was the goal again?" middle of context · 30%+ accuracy drop context bloats · planning gets buried FAILURE MODE jumps to code · loses the plot · loops forever Two-agent split planner ↔ implementer · separate windows PLANNER CTX goal · tree · README last diff + test result small · stable · re-built each turn IMPL CTX one directive tool calls × N (this task only) deep · narrow · disposable plan diff what to do · vs · how to do it planner holds the project in its head · implementer holds only the next step RESULT each model specialises · neither overloads
Single-agent loops are remarkable until the context fills up; the two-agent split keeps each window small and on-topic by separating what to do from how to do it.

Why Separate Planning from Execution

The central insight is small but load-bearing. In a single-agent loop, the model is asked to do two qualitatively different things at once — hold the long-range plan in mind and grind through dozens of low-level tool calls (read this file, run this test, fix that line). Tool output is verbose; plans are terse; verbose-and-stale always beats terse-and-fresh in a long conversation. Once the plan is buried under 40 turns of stack traces and file contents, it stops steering the work.

Splitting the roles fixes that asymmetry. The Planner sees a small, freshly built packet — file tree, README, last diff, last test output — and decides what should happen next. It never makes a tool call. The Implementer gets exactly one directive and a clean context window in which to make as many tool calls as it needs. Each model specialises: one carries the project in its head, the other carries only the current task. Neither has to fight for context room.

The case for splitting planning from execution — even just into two roles — is empirically strong. Several independent research lines converge on the same finding:

Anthropic's own "Building Effective Agents" post (December 2024) names two of its five recommended patterns directly: Orchestrator-Workers and Evaluator-Optimizer. A Planner/Implementer loop is exactly the Evaluator-Optimizer pattern with a small twist — the evaluator and the planner are the same agent, holding the long-running plan in its head while reviewing each implementation step. As Anthropic put it: "the most successful implementations weren't using complex frameworks... they were building with simple, composable patterns."

The dual-process framing from cognitive science maps cleanly onto the two roles. The Implementer is System 1 — fast, reactive, focused on execution. The Planner is System 2 — slower, deliberate, focused on what should happen next and why. The split isn't strictly necessary, but it's recognizably useful, and the research above suggests the small extra latency is worth it.

There's also a practical reason. A single Claude instance asked to both plan and code has a strong bias toward writing code, even when the right move is to stop and think. Splitting the roles forces each instance to do its job fully before handing off. I noticed this within an hour of testing — the moment I gave one model both jobs, it stopped reasoning about whether the previous step was actually correct.

Two Roles, One Goal

The architecture is the simplest thing that works: two Claude Code instances, each with a different system prompt, each specialised for one part of the cycle.

The Planner is the senior tech lead. It sees the current state of the repository — file tree, README, recent git log, the last diff produced, the test output — and decides what needs to happen next. It doesn't write code; it doesn't make tool calls at all. Its job is to think, explain, and hand off a clear instruction. Its output is a natural-language directive like "add a JWT middleware at src/auth/jwt.ts that validates the access token and sets req.user. Don't touch the refresh flow yet — we'll do that next."

The Implementer is the engineer. It receives the directive from the Planner and executes it. It reads the files it needs, makes the edits, runs the tests, and reports back what it changed. It does not second-guess the plan. Its job is to make the diff happen cleanly.

That's the whole separation: the Planner has full project context but no hands; the Implementer has hands but only the current task in its head. Neither has to do both jobs at once, which is exactly the failure mode of the single-agent loop.

Architecture

The orchestrator is the thin Python shim that wires everything together. It holds the conversation state for each role, captures git diffs and test output between rounds, and forwards a curated context packet to each Planner turn. It is not an "agent" itself — it doesn't reason, it just routes messages and runs subprocesses.

user goal "add JWT auth" orchestrator ~200 lines of Python · routes messages · runs subprocess · no reasoning — THE TWO AGENTS — Planner decides what to do next full project context no tool calls plan flows down diff flows up Implementer makes the change narrow scope · this task only all tool calls tool calls — THE WORLD UNDER TEST — git repository files · history · diffs source of truth for "what changed" test runner pytest · npm test · cargo test source of truth for "did it work" git diff test results
The full architecture. Plans flow down (oxblood), diffs flow back up (grey). The Implementer is the only side allowed to touch the world; the Planner only ever reads the trail it leaves behind.

That asymmetry — only the Implementer can act, only the Planner can plan — is the whole design. Everything else is logistics.

One task, end to end

Here's what a single round looks like, message by message, from the moment the user types a goal to the moment a diff lands in the repo:

USER PLANNER IMPLEMENTER TOOLS / REPO ORCHESTRATOR "add JWT auth" read directive: scaffold middleware at src/auth/jwt.ts read · edit · run tool calls × N file contents · stdout · errors "done — see diff" git diff + pytest next packet · diff + test result — loop, until Planner emits GOAL COMPLETE —
One task, five lanes. Plans flow right; tool calls and diffs flow back. The orchestrator is the only entity that talks to both agents.

Context Engineering

Context management turned out to be the single most important engineering decision. Bigger context windows do not automatically mean better results — the "lost in the middle" effect (Liu et al., Stanford, TACL 2024) shows models drop 30%+ accuracy on content in the middle of long contexts, even with 1M-token windows.

The context packet sent to the Planner before each turn includes:

Context packet built fresh each iteration STATIC CONTEXT — changes slowly File tree repo structure Git log last 15 commits README project intent FEEDBACK SIGNAL — plan vs. reality Previous diff + test output what changed last round · what passed · what failed Planner
Three slow-changing slices plus one dynamic feedback signal — built fresh every iteration.

What you exclude matters as much as what you include. The packet deliberately omits full file contents, dependency trees, and generated documentation. Recent research on AGENTS.md-style context files (early 2026) found that LLM-generated context actually decreased agent success rates and increased inference costs by over 20%. The shallow packet keeps the Planner focused on the delta — what changed and what to do next — rather than drowning it in static detail.

The test output is the part that took me the longest to get right, and it's also the part I'm most certain about. AlphaCodium (Ridnik et al., 2024) reported an unexpected negative result: "injecting the last git patch diff to the prompt: no improvement was seen." Raw diffs alone are not a useful feedback signal — the Planner has no way to know whether the diff was actually correct. Diffs paired with test results are a different story. The diff says what changed; the test results say whether it worked.

I still want to do better here. Aider's repository map system uses tree-sitter to parse 26+ languages into ASTs and applies personalized PageRank to identify the most relevant symbols relative to the current edit. The result is a scope-aware elided code view that fits in a small token budget. My orchestrator doesn't do this yet — it just sends the full file tree. PageRank-based context selection is the highest-value thing I'd add next.

A counterintuitive finding from JetBrains Research (NeurIPS 2025 DL4C Workshop) is worth flagging: observation masking — hiding verbose tool outputs from older turns while preserving the action and reasoning history — matched or beat LLM summarization in 4 of 5 settings, while being simpler and cheaper. Both approaches cut token costs by 50%+ compared to unmanaged context. The lesson: don't summarize what you can hide.

The Feedback Loop

The loop is where the system earns its keep. After the Implementer reports back, the orchestrator runs git diff and the project's test command, then pipes both into the next Planner round.

That's when the interesting things start happening. The Planner notices that the Implementer added a route but forgot to register it in the router. It notices that a new function was added but no tests, and that the existing tests still pass — meaning the change isn't actually exercised. It notices that an edit touched an unrelated file by accident. It writes the next directive with all of that in hand: "good work on the middleware — now wire it into src/app.ts and add a test for the invalid-token case."

This matches the broader research consensus that external feedback signals are essential for iterative improvement to work. Self-Refine (Madaan et al., NeurIPS 2023) showed ~20% average improvement from generate-feedback-refine loops. Self-Debugging (Chen et al., ICLR 2024) gained up to +12% accuracy with execution feedback. Reflexion (Shinn et al., NeurIPS 2023) pushed HumanEval to 91% pass@1 by adding episodic memory of past failures.

But the same literature contains a sharp warning: intrinsic self-correction without external feedback degrades performance. "Large Language Models Cannot Self-Correct Reasoning Yet" (Huang et al., ICLR 2024) showed LLMs sometimes change correct answers to wrong ones when asked to self-evaluate without grounding. A TACL 2024 survey (Kamoi et al.) concluded that "no prior work demonstrates successful self-correction with feedback from prompted LLMs" — only with grounded signals like code execution.

The implication for any Planner/Implementer architecture is direct: the feedback loop is only as good as the external signals you wire into it. Diffs without tests are just opinion. Tests without diffs are missing the cause. You need both.

ITERATION 1 Plan: scaffold routes + middleware Implement +4 files +180 lines ITERATION 2 Plan: fix gaps wire router + tests Implement +2 files, +60 lines ITERATION 3 Plan: polish edge case + cleanup Implement ~20 lines scope of changes narrows each round ✓ GOAL COMPLETE
A typical 3-iteration run. Each round produces a smaller diff than the last as the Planner converges on the goal.

Running It

A full run looks like this:

# entry point          target repo            natural-language goal
python3 claude_loop.py ./my-app "Add JWT auth with refresh tokens" \
  --effort max          # thinking budget per turn (low | medium | max)
  --planner-model opus  # model for the planner — opus = careful
  --impl-model opus     # model for the implementer — same here

The key options:

Knowing When to Stop

A loop that doesn't know when to stop is worse than no loop at all.

The empirical evidence on iteration is consistent: most improvement happens in the first 1–2 iterations, with rapidly diminishing returns. Self-Refine's own data shows the bulk of gain in iterations 0→2. Self-Debugging plateaus at 2–3 turns. AlphaCodium captures the largest gains on the first test-fixing pass.

There are also failure modes that get worse with more iterations:

Termination in my orchestrator combines four signals, in order of priority:

  1. The Planner declares victory. After reviewing the most recent diff and test output, if the Planner concludes the goal is achieved, it emits a GOAL COMPLETE sentinel and the orchestrator exits cleanly.
  2. External validation gates pass. If all the project's tests pass, the linter is clean, and the diff matches the goal, that's strong evidence to stop.
  3. No-change detection. If two consecutive Implementer turns produce no meaningful diff, the loop has stalled and there's nothing more to learn from continuing.
  4. Iteration cap. Set with --max-iterations, default 5. The escape hatch for runs stuck in a local minimum.

In practice, the first two mechanisms handle roughly 80% of completed runs. The cap catches the pathological cases.

The other 20% usually fails in a more subtle way: the Planner emits GOAL COMPLETE because the tests pass, but the feature is quietly incomplete — an edge case unhandled, an error path uncovered, a refactor that left dead code behind. A more principled termination check might dispatch a verification prompt to a third Claude instance, asking it to grade the final state against the original goal — though that adds the kind of complexity I deliberately tried to avoid.

Recent theoretical work on Markovian Generation Chains formalizes the intuition behind early termination — under greedy decoding, iterative LLM output rapidly converges to fixed points or short cycles. That's both an explanation for why three iterations is usually enough and a warning about why long runs don't help.

What You Can Build

I've used the loop on a range of tasks. Some it handles cleanly, others reveal the seams.

Typical runs by the numbers. A well-scoped bug fix completes in 2–3 iterations and costs roughly $0.50–$2.00 in API usage. A full feature like adding JWT auth runs 4–6 iterations at $3–$8. Test coverage campaigns are the most iteration-heavy — sometimes 8–10 rounds — but each round is cheap because the Implementer is mostly writing straightforward test code. Token usage scales roughly linearly with iteration count: about 15K–25K tokens per round for the Planner and 20K–40K for the Implementer.

Works well

Works poorly

What I'd Build Next

The architecture works. The interesting questions now are about the things around it.

Get Started

The full source is on GitHub: github.com/josephgec/claude-agent.

There's no installation beyond cloning the repo and having claude in your PATH. The orchestrator is a single Python file. Fork it, modify the system prompts, tune it to your taste — the architecture is simple enough that you can reshape it without fighting a framework.

If you try it on something interesting, I'd love to hear what worked and what broke. The most valuable feedback so far has been about the termination criterion — the Planner's GOAL COMPLETE call is fuzzy, and there's a more principled version hiding somewhere in the validation-gate plus no-change-detection combination.

References

A reference list of the research, blog posts, and open-source projects this article draws on. The literature on coding agents fragments into four loosely connected camps — multi-agent systems, iterative self-refinement, planning & execution, and the supporting work on context management and benchmarks. Most arXiv links open directly to the paper.

Two-agent loop this essay Multi-agent systems role-based · pipelined MetaGPT · ChatDev · AutoGen Magentic-One · SWE-Agent Aider Architect/Editor Iterative self-refinement feedback · debug · reflect Self-Refine · Self-Debug Reflexion · LATS · AlphaCodium Cannot-Self-Correct · RLEF Planning & execution decompose · then act Plan-and-Solve · ADaPT ReWOO · LLMCompiler GoalAct · Plan-and-Act Context · termination · evals what to keep · what to cut Lost in the Middle · JetBrains SWE-bench · LiveCodeBench FeatureBench · Failure Taxonomy
Fifty-plus papers, four conversations. The two-agent loop sits at the intersection: it borrows the role split from multi-agent work, the iteration discipline from self-refinement, the decomposition habit from planning research, and the don't-bloat-the-context lesson from the long-context literature.

Multi-agent coding systems

Iterative self-refinement

Planning and execution

Context and termination

Benchmarks & tools