Two AIs, One Loop — Building a Self-Improving Code Agent

LLMs are remarkable at taking a bounded request and producing code. But real software engineering isn't bounded. Building a feature means planning, coding, testing, discovering you got it wrong, and trying again. It's iterative by nature.

Tools like Claude Code are already agents — large language models running inside a loop, allowed to read files, run shell commands, and check results before responding. They iterate internally: gather context, take action, verify, repeat. But the same instance does both the planning and the coding inside a single context window. When one model holds both jobs at once, there's a strong bias toward jumping to code even when the right move is to stop and think. What if you split the roles and gave each one room to do its job fully?

I wanted to see what would happen if I closed the loop myself. Not a fancy multi-agent framework with five or seven specialised roles arguing through structured artifacts. Not a task tree. Just two Claude Code instances wired together in a cycle, with an orchestrator forwarding context, git diffs (the textual record of what changed in the code, line by line), and test results between them until the job is done.

What I found, both in my own experiments and in the surprisingly large body of research that already exists on this question, is that the architecture matters less than I expected — and the engineering decisions around it matter much more.

Before this. Earlier coding agents fell into two camps. The first was the single-agent loop: one LLM in a long-running ReAct cycle, gathering context, calling tools, and editing files until it declared done. As the conversation grew, the model's context window filled up with raw tool output, "lost in the middle" effects kicked in, and the agent forgot what it was originally doing. The second camp was the multi-agent swarm — MetaGPT, ChatDev, AutoGen — five to seven role-played agents passing structured artifacts through a virtual software company. They produced impressive demos but spent most of their tokens on inter-agent ceremony, and an Anthropic post-mortem on agent failures (2025) traced most of the wasted compute to coordination overhead, not to the underlying coding work. The two-agent split is the smallest architecture that gets the benefit of separation without paying the swarm tax.

The Landscape

Before I started, I wanted to understand what others had built. The space turns out to be larger and more contradictory than it looks.

On one end, there are role-based multi-agent frameworks like MetaGPT (5+ roles passing structured artifacts through a software-development pipeline) and ChatDev (7 roles communicating through pairwise "atomic chats"). Both treat software development as a virtual company with a division of labor. Both produce impressive demos and complete projects in under 7 minutes for under a dollar.

On the other end, there are deceptively powerful single-agent systems. SWE-Agent showed that optimizing the Agent-Computer Interface — custom commands, structured search, linting on edit — matters more than agent count. Its mini-swe-agent variant achieves 65% on SWE-bench Verified in roughly 100 lines of Python. Claude Code itself, the tool I built on, scores 80.8% on SWE-bench Verified as a single agent paired with strong tooling.

In the middle sits the architecture I ended up with: two agents, one plans, one executes, with a tight feedback loop between them. The closest production implementation is Aider's Architect/Editor mode, which uses an Architect model (e.g., o1-preview) to plan changes and an Editor model (e.g., Claude 3.5 Sonnet) to execute the specific edits, with git tracking every change. It scored 85% on Aider's code editing benchmark — the highest recorded score on that suite, and a useful sanity check that the pattern is sound.

The industry trend through 2025 has actually moved away from heavy multi-agent orchestration toward "powerful single agents with excellent tooling." Devin, despite the $4B valuation and "first AI software engineer" branding, faced scrutiny when an independent Answer.AI evaluation found a 15% success rate on real tasks. The agents that perform best on benchmarks rely on strong models paired with well-designed interfaces, not on agent count.

So why bother with two agents at all? Two reasons, both grounded in research.

Single-agent loops are remarkable until the context fills up; the two-agent split keeps each window small and on-topic by separating what to do from how to do it.

Why Separate Planning from Execution

The central insight is small but load-bearing. In a single-agent loop, the model is asked to do two qualitatively different things at once — hold the long-range plan in mind and grind through dozens of low-level tool calls (read this file, run this test, fix that line). Tool output is verbose; plans are terse; verbose-and-stale always beats terse-and-fresh in a long conversation. Once the plan is buried under 40 turns of stack traces and file contents, it stops steering the work.

Splitting the roles fixes that asymmetry. The Planner sees a small, freshly built packet — file tree, README, last diff, last test output — and decides what should happen next. It never makes a tool call. The Implementer gets exactly one directive and a clean context window in which to make as many tool calls as it needs. Each model specialises: one carries the project in its head, the other carries only the current task. Neither has to fight for context room.

The case for splitting planning from execution — even just into two roles — is empirically strong. Several independent research lines converge on the same finding:

Plan-and-Solve Prompting (Wang et al., ACL 2023) showed that simply asking the model to "first understand the problem and devise a plan" consistently outperforms zero-shot Chain-of-Thought.
ADaPT (Prasad et al., NAACL 2024) introduced recursive decomposition — when the executor fails, the planner decomposes further — achieving +28% on ALFWorld, +27% on WebShop, and +33% on TextCraft over ReAct baselines.
GoalAct (Chen et al., 2025) quantified the inverse: removing global planning reduced agent performance by 8% on average and 14% on coding tasks specifically.
Plan-and-Act (Google DeepMind, 2025) demonstrated state-of-the-art on web navigation by separating high-level planning from low-level execution and training each side independently.

Anthropic's own "Building Effective Agents" post (December 2024) names two of its five recommended patterns directly: Orchestrator-Workers and Evaluator-Optimizer. A Planner/Implementer loop is exactly the Evaluator-Optimizer pattern with a small twist — the evaluator and the planner are the same agent, holding the long-running plan in its head while reviewing each implementation step. As Anthropic put it: "the most successful implementations weren't using complex frameworks... they were building with simple, composable patterns."

The dual-process framing from cognitive science maps cleanly onto the two roles. The Implementer is System 1 — fast, reactive, focused on execution. The Planner is System 2 — slower, deliberate, focused on what should happen next and why. The split isn't strictly necessary, but it's recognizably useful, and the research above suggests the small extra latency is worth it.

There's also a practical reason. A single Claude instance asked to both plan and code has a strong bias toward writing code, even when the right move is to stop and think. Splitting the roles forces each instance to do its job fully before handing off. I noticed this within an hour of testing — the moment I gave one model both jobs, it stopped reasoning about whether the previous step was actually correct.

Two Roles, One Goal

The architecture is the simplest thing that works: two Claude Code instances, each with a different system prompt, each specialised for one part of the cycle.

The Planner is the senior tech lead. It sees the current state of the repository — file tree, README, recent git log, the last diff produced, the test output — and decides what needs to happen next. It doesn't write code; it doesn't make tool calls at all. Its job is to think, explain, and hand off a clear instruction. Its output is a natural-language directive like "add a JWT middleware at src/auth/jwt.ts that validates the access token and sets req.user. Don't touch the refresh flow yet — we'll do that next."

The Implementer is the engineer. It receives the directive from the Planner and executes it. It reads the files it needs, makes the edits, runs the tests, and reports back what it changed. It does not second-guess the plan. Its job is to make the diff happen cleanly.

That's the whole separation: the Planner has full project context but no hands; the Implementer has hands but only the current task in its head. Neither has to do both jobs at once, which is exactly the failure mode of the single-agent loop.

Architecture

The orchestrator is the thin Python shim that wires everything together. It holds the conversation state for each role, captures git diffs and test output between rounds, and forwards a curated context packet to each Planner turn. It is not an "agent" itself — it doesn't reason, it just routes messages and runs subprocesses.

The full architecture. Plans flow down (oxblood), diffs flow back up (grey). The Implementer is the only side allowed to touch the world; the Planner only ever reads the trail it leaves behind.

That asymmetry — only the Implementer can act, only the Planner can plan — is the whole design. Everything else is logistics.

One task, end to end

Here's what a single round looks like, message by message, from the moment the user types a goal to the moment a diff lands in the repo:

One task, five lanes. Plans flow right; tool calls and diffs flow back. The orchestrator is the only entity that talks to both agents.

Context Engineering

Context management turned out to be the single most important engineering decision. Bigger context windows do not automatically mean better results — the "lost in the middle" effect (Liu et al., Stanford, TACL 2024) shows models drop 30%+ accuracy on content in the middle of long contexts, even with 1M-token windows.

The context packet sent to the Planner before each turn includes:

File tree — a find-style listing of the repository, excluding node_modules, .git, and build artifacts.
Git log — the last 15 commits on the current branch for project memory.
README — verbatim, so the Planner understands what the project is for.
Previous diff — the unified diff the Implementer just produced.
Test output — pass/fail counts and failure traces, captured from running the project's test command after the previous Implementer turn.

Three slow-changing slices plus one dynamic feedback signal — built fresh every iteration.

What you exclude matters as much as what you include. The packet deliberately omits full file contents, dependency trees, and generated documentation. Recent research on AGENTS.md-style context files (early 2026) found that LLM-generated context actually decreased agent success rates and increased inference costs by over 20%. The shallow packet keeps the Planner focused on the delta — what changed and what to do next — rather than drowning it in static detail.

The test output is the part that took me the longest to get right, and it's also the part I'm most certain about. AlphaCodium (Ridnik et al., 2024) reported an unexpected negative result: "injecting the last git patch diff to the prompt: no improvement was seen." Raw diffs alone are not a useful feedback signal — the Planner has no way to know whether the diff was actually correct. Diffs paired with test results are a different story. The diff says what changed; the test results say whether it worked.

I still want to do better here. Aider's repository map system uses tree-sitter to parse 26+ languages into ASTs and applies personalized PageRank to identify the most relevant symbols relative to the current edit. The result is a scope-aware elided code view that fits in a small token budget. My orchestrator doesn't do this yet — it just sends the full file tree. PageRank-based context selection is the highest-value thing I'd add next.

A counterintuitive finding from JetBrains Research (NeurIPS 2025 DL4C Workshop) is worth flagging: observation masking — hiding verbose tool outputs from older turns while preserving the action and reasoning history — matched or beat LLM summarization in 4 of 5 settings, while being simpler and cheaper. Both approaches cut token costs by 50%+ compared to unmanaged context. The lesson: don't summarize what you can hide.

The Feedback Loop

The loop is where the system earns its keep. After the Implementer reports back, the orchestrator runs git diff and the project's test command, then pipes both into the next Planner round.

That's when the interesting things start happening. The Planner notices that the Implementer added a route but forgot to register it in the router. It notices that a new function was added but no tests, and that the existing tests still pass — meaning the change isn't actually exercised. It notices that an edit touched an unrelated file by accident. It writes the next directive with all of that in hand: "good work on the middleware — now wire it into src/app.ts and add a test for the invalid-token case."

This matches the broader research consensus that external feedback signals are essential for iterative improvement to work. Self-Refine (Madaan et al., NeurIPS 2023) showed ~20% average improvement from generate-feedback-refine loops. Self-Debugging (Chen et al., ICLR 2024) gained up to +12% accuracy with execution feedback. Reflexion (Shinn et al., NeurIPS 2023) pushed HumanEval to 91% pass@1 by adding episodic memory of past failures.

But the same literature contains a sharp warning: intrinsic self-correction without external feedback degrades performance. "Large Language Models Cannot Self-Correct Reasoning Yet" (Huang et al., ICLR 2024) showed LLMs sometimes change correct answers to wrong ones when asked to self-evaluate without grounding. A TACL 2024 survey (Kamoi et al.) concluded that "no prior work demonstrates successful self-correction with feedback from prompted LLMs" — only with grounded signals like code execution.

The implication for any Planner/Implementer architecture is direct: the feedback loop is only as good as the external signals you wire into it. Diffs without tests are just opinion. Tests without diffs are missing the cause. You need both.

A typical 3-iteration run. Each round produces a smaller diff than the last as the Planner converges on the goal.

Running It

A full run looks like this:

# entry point          target repo            natural-language goal
python3 claude_loop.py ./my-app "Add JWT auth with refresh tokens" \
  --effort max          # thinking budget per turn (low | medium | max)
  --planner-model opus  # model for the planner — opus = careful
  --impl-model opus     # model for the implementer — same here

The key options:

--effort [low | medium | max] — how aggressively each Claude instance should think before responding. At max, the Planner is notably more careful about edge cases.
--planner-model and --impl-model — pick the model for each role independently. Opus for both when the goal is tricky; Sonnet for both when speed matters; mixed when the plan is hard but the code is mechanical.
--max-iterations — safety cap so a confused loop doesn't run all night.
--timeout — per-turn timeout so a stuck subprocess doesn't hang the loop.

Knowing When to Stop

A loop that doesn't know when to stop is worse than no loop at all.

The empirical evidence on iteration is consistent: most improvement happens in the first 1–2 iterations, with rapidly diminishing returns. Self-Refine's own data shows the bulk of gain in iterations 0→2. Self-Debugging plateaus at 2–3 turns. AlphaCodium captures the largest gains on the first test-fixing pass.

There are also failure modes that get worse with more iterations:

Self-bias amplification — LLMs systematically overrate their own generations, and the bias compounds monotonically across refinement rounds (Xu et al., 2024).
Self-conditioning — per-step accuracy degrades as the agent sees more of its own prior outputs.
Oscillation — agents can ping-pong between two states without converging. Claude Code's own production incident report (GitHub Issue #15909) describes a sub-agent that consumed 27M tokens in an infinite loop over 4.6 hours.

Termination in my orchestrator combines four signals, in order of priority:

The Planner declares victory. After reviewing the most recent diff and test output, if the Planner concludes the goal is achieved, it emits a GOAL COMPLETE sentinel and the orchestrator exits cleanly.
External validation gates pass. If all the project's tests pass, the linter is clean, and the diff matches the goal, that's strong evidence to stop.
No-change detection. If two consecutive Implementer turns produce no meaningful diff, the loop has stalled and there's nothing more to learn from continuing.
Iteration cap. Set with --max-iterations, default 5. The escape hatch for runs stuck in a local minimum.

In practice, the first two mechanisms handle roughly 80% of completed runs. The cap catches the pathological cases.

The other 20% usually fails in a more subtle way: the Planner emits GOAL COMPLETE because the tests pass, but the feature is quietly incomplete — an edge case unhandled, an error path uncovered, a refactor that left dead code behind. A more principled termination check might dispatch a verification prompt to a third Claude instance, asking it to grade the final state against the original goal — though that adds the kind of complexity I deliberately tried to avoid.

Recent theoretical work on Markovian Generation Chains formalizes the intuition behind early termination — under greedy decoding, iterative LLM output rapidly converges to fixed points or short cycles. That's both an explanation for why three iterations is usually enough and a warning about why long runs don't help.

What You Can Build

I've used the loop on a range of tasks. Some it handles cleanly, others reveal the seams.

Typical runs by the numbers. A well-scoped bug fix completes in 2–3 iterations and costs roughly $0.50–$2.00 in API usage. A full feature like adding JWT auth runs 4–6 iterations at $3–$8. Test coverage campaigns are the most iteration-heavy — sometimes 8–10 rounds — but each round is cheap because the Implementer is mostly writing straightforward test code. Token usage scales roughly linearly with iteration count: about 15K–25K tokens per round for the Planner and 20K–40K for the Implementer.

Works well

Well-scoped bug fixes where the failing test or error trace points at the right place.
Full features that are mostly mechanical — adding routes, models, and a minimal test pass.
Refactors where the goal can be stated precisely ("inline this helper, then delete it").
Test coverage campaigns — "write tests until this file is at 80%."

Works poorly

Open-ended architectural decisions. The Planner will commit to a direction and defend it; stepping back to reconsider is hard.
Cross-repo or cross-service work. The loop only sees one repository.
UI work that needs human judgment about visual quality.
Greenfield feature development in unfamiliar codebases. FeatureBench found that even the strongest current frontier models score around 11% on feature development versus 74% on bug-fixing. Building new things is much harder than fixing existing ones, and the loop doesn't close that gap on its own.

What I'd Build Next

The architecture works. The interesting questions now are about the things around it.

Repository-aware context selection. Borrow Aider's PageRank approach. Today my context is a flat file listing; the next version should weight files by relevance to the current goal.
Tree-search rather than linear refinement. LATS (Zhou et al., ICML 2024) achieved 94.4% pass@1 on HumanEval by exploring multiple refinement paths via Monte Carlo Tree Search instead of iterating on a single one. For hard problems, branching beats iterating.
RL-trained iteration. RLEF (Gehring et al., ICML 2025) showed that an RL-trained 8B model outperformed AlphaCodium's 100-sample approach with only 3 LLM responses. The implication: the loop's value depends on whether the model has been trained to use feedback well, not just on whether the loop exists.
Sub-agent isolation for exploration. Claude Code itself uses sub-agents to delegate heavy file exploration to isolated context windows — 30K tokens of exploration returns as a 2K summary, a 93% context savings. My orchestrator should do the same for the Implementer's read phase.

Get Started

The full source is on GitHub: github.com/josephgec/claude-agent.

There's no installation beyond cloning the repo and having claude in your PATH. The orchestrator is a single Python file. Fork it, modify the system prompts, tune it to your taste — the architecture is simple enough that you can reshape it without fighting a framework.

If you try it on something interesting, I'd love to hear what worked and what broke. The most valuable feedback so far has been about the termination criterion — the Planner's GOAL COMPLETE call is fuzzy, and there's a more principled version hiding somewhere in the validation-gate plus no-change-detection combination.

References

A reference list of the research, blog posts, and open-source projects this article draws on. The literature on coding agents fragments into four loosely connected camps — multi-agent systems, iterative self-refinement, planning & execution, and the supporting work on context management and benchmarks. Most arXiv links open directly to the paper.

Fifty-plus papers, four conversations. The two-agent loop sits at the intersection: it borrows the role split from multi-agent work, the iteration discipline from self-refinement, the decomposition habit from planning research, and the don't-bloat-the-context lesson from the long-context literature.

Multi-agent coding systems

Hong et al. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. ICLR 2024. arXiv:2308.00352
Qian et al. ChatDev: Communicative Agents for Software Development. ACL 2024. arXiv:2307.07924
Wu et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. 2023. arXiv:2308.08155
Fourney et al. Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks. Microsoft Research, November 2024. microsoft.com
Yang et al. SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS 2024. arXiv:2405.15793
Aider — Architect/Editor mode. aider.chat

Iterative self-refinement

Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023. arXiv:2303.17651
Chen et al. Teaching Large Language Models to Self-Debug. ICLR 2024. arXiv:2304.05128
Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366
Zhou et al. Language Agent Tree Search. ICML 2024. arXiv:2310.04406
Ridnik et al. Code Generation with AlphaCodium. 2024. arXiv:2401.08500
Huang et al. Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024. arXiv:2310.01798
Olausson et al. Is Self-Repair a Silver Bullet for Code Generation? ICLR 2024. arXiv:2306.09896
Kamoi et al. When Can LLMs Actually Correct Their Own Mistakes? TACL 2024. arXiv:2406.01297
Gehring et al. RLEF: Grounding Code LLMs in Execution Feedback with RL. ICML 2025. arXiv:2410.02089

Planning and execution

Wang et al. Plan-and-Solve Prompting. ACL 2023. arXiv:2305.04091
Prasad et al. ADaPT: As-Needed Decomposition and Planning with Language Models. NAACL 2024 Findings. arXiv:2311.05772
Xu et al. ReWOO: Decoupling Reasoning from Observations. 2023. arXiv:2305.18323
Kim et al. LLMCompiler: An LLM Compiler for Parallel Function Calling. ICML 2024. arXiv:2312.04511
Chen et al. GoalAct: Planning Decomposition Improves Coding Agents. 2025. arXiv:2504.16563
Plan-and-Act. Google DeepMind, 2025. arXiv:2503.09572
Schluntz & Zhang. Building Effective Agents. Anthropic, December 2024. anthropic.com

Context and termination

Liu et al. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024. arXiv:2307.03172
Fraser & Lindenbauer. Context Compression for Coding Agents. JetBrains Research, NeurIPS 2025 DL4C Workshop. arXiv:2508.21433
Cemri et al. A Taxonomy of Multi-Agent Failures. 2025. arXiv:2503.13657

Benchmarks & tools

Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770
SWE-bench Pro. Scale AI, 2025. arXiv:2509.16941
Jain et al. LiveCodeBench. 2024. arXiv:2403.07974
Anthropic. Claude Code. anthropic.com/claude-code
Claude Agent SDK. docs.claude.com
Project source: github.com/josephgec/claude-agent