Two AIs, One Loop — Building a Self-Improving Code Agent
A well-designed two-agent architecture — one LLM planning, another implementing, with git diffs and test results flowing between them — captures most of the value of multi-agent coding systems while avoiding their complexity. Here's what I built, and what the research literature says about why it works.
LLMs are remarkable at taking a bounded request and producing code. But real software engineering isn't bounded. Building a feature means planning, coding, testing, discovering you got it wrong, and trying again. It's iterative by nature.
Tools like Claude Code already iterate internally — gather context, take action, verify, repeat. But they mix planning and coding in a single context window. When one instance does both, there's a strong bias toward jumping to code even when the right move is to stop and think. What if you could split the roles and give each one room to do its job fully?
I wanted to see what would happen if I closed the loop myself. Not a fancy multi-agent framework. Not a task tree. Just two Claude Code instances wired together in a cycle, with an orchestrator forwarding context, diffs, and test results between them until the job is done.
What I found, both in my own experiments and in the surprisingly large body of research that already exists on this question, is that the architecture matters less than I expected — and the engineering decisions around it matter much more.
The Landscape
Before I started, I wanted to understand what others had built. The space turns out to be larger and more contradictory than it looks.
On one end, there are role-based multi-agent frameworks like MetaGPT (5+ roles passing structured artifacts through a software-development pipeline) and ChatDev (7 roles communicating through pairwise "atomic chats"). Both treat software development as a virtual company with a division of labor. Both produce impressive demos and complete projects in under 7 minutes for under a dollar.
On the other end, there are deceptively powerful single-agent systems. SWE-Agent showed that optimizing the Agent-Computer Interface — custom commands, structured search, linting on edit — matters more than agent count. Its mini-swe-agent variant achieves 65% on SWE-bench Verified in roughly 100 lines of Python. Claude Code itself, the tool I built on, scores 80.8% on SWE-bench Verified as a single agent paired with strong tooling.
In the middle sits the architecture I ended up with: two agents, one plans, one executes, with a tight feedback loop between them. The closest production implementation is Aider's Architect/Editor mode, which uses an Architect model (e.g., o1-preview) to plan changes and an Editor model (e.g., Claude 3.5 Sonnet) to execute the specific edits, with git tracking every change. It scored 85% on Aider's code editing benchmark — the highest recorded score on that suite, and a useful sanity check that the pattern is sound.
The industry trend through 2025 has actually moved away from heavy multi-agent orchestration toward "powerful single agents with excellent tooling." Devin, despite the $4B valuation and "first AI software engineer" branding, faced scrutiny when an independent Answer.AI evaluation found a 15% success rate on real tasks. The agents that perform best on benchmarks rely on strong models paired with well-designed interfaces, not on agent count.
So why bother with two agents at all? Two reasons, both grounded in research.
Why Separate Planning from Execution
The case for splitting planning from execution — even just into two roles — is empirically strong. Several independent research lines converge on the same finding:
- Plan-and-Solve Prompting (Wang et al., ACL 2023) showed that simply asking the model to "first understand the problem and devise a plan" consistently outperforms zero-shot Chain-of-Thought.
- ADaPT (Prasad et al., NAACL 2024) introduced recursive decomposition — when the executor fails, the planner decomposes further — achieving +28% on ALFWorld, +27% on WebShop, and +33% on TextCraft over ReAct baselines.
- GoalAct (Chen et al., 2025) quantified the inverse: removing global planning reduced agent performance by 8% on average and 14% on coding tasks specifically.
- Plan-and-Act (Google DeepMind, 2025) demonstrated state-of-the-art on web navigation by separating high-level planning from low-level execution and training each side independently.
Anthropic's own "Building Effective Agents" post (December 2024) names two of its five recommended patterns directly: Orchestrator-Workers and Evaluator-Optimizer. A Planner/Implementer loop is exactly the Evaluator-Optimizer pattern with a small twist — the evaluator and the planner are the same agent, holding the long-running plan in its head while reviewing each implementation step. As Anthropic put it: "the most successful implementations weren't using complex frameworks... they were building with simple, composable patterns."
The dual-process framing from cognitive science maps cleanly onto the two roles. The Implementer is System 1 — fast, reactive, focused on execution. The Planner is System 2 — slower, deliberate, focused on what should happen next and why. The split isn't strictly necessary, but it's recognizably useful, and the research above suggests the small extra latency is worth it.
There's also a practical reason. A single Claude instance asked to both plan and code has a strong bias toward writing code, even when the right move is to stop and think. Splitting the roles forces each instance to do its job fully before handing off. I noticed this within an hour of testing — the moment I gave one model both jobs, it stopped reasoning about whether the previous step was actually correct.
Two Roles, One Goal
The architecture is the simplest thing that works: two Claude Code instances, each with a different system prompt, each specialized for one part of the cycle.
The Planner is the senior tech lead. It sees the current state of the repository — file tree, README, recent git log, the last diff produced, the test output — and decides what needs to happen next. It doesn't write code. Its job is to think, explain, and hand off a clear instruction. Its output is a natural-language directive like "add a JWT middleware at src/auth/jwt.ts that validates the access token and sets req.user. Don't touch the refresh flow yet — we'll do that next."
The Implementer is the engineer. It receives the directive from the Planner and executes it. It reads the files it needs, makes the edits, runs the tests, and reports back what it changed. It does not second-guess the plan. Its job is to make the diff happen cleanly.
Architecture
The orchestrator is a thin Python shim. It holds the conversation state for each role, captures git diffs and test output between rounds, and forwards a curated context packet to each Planner turn.
Context Engineering
Context management turned out to be the single most important engineering decision. Bigger context windows do not automatically mean better results — the "lost in the middle" effect (Liu et al., Stanford, TACL 2024) shows models drop 30%+ accuracy on content in the middle of long contexts, even with 1M-token windows.
The context packet sent to the Planner before each turn includes:
- File tree — a
find-style listing of the repository, excludingnode_modules,.git, and build artifacts. - Git log — the last 15 commits on the current branch for project memory.
- README — verbatim, so the Planner understands what the project is for.
- Previous diff — the unified diff the Implementer just produced.
- Test output — pass/fail counts and failure traces, captured from running the project's test command after the previous Implementer turn.
What you exclude matters as much as what you include. The packet deliberately omits full file contents, dependency trees, and generated documentation. Recent research on AGENTS.md-style context files (early 2026) found that LLM-generated context actually decreased agent success rates and increased inference costs by over 20%. The shallow packet keeps the Planner focused on the delta — what changed and what to do next — rather than drowning it in static detail.
The test output is the part that took me the longest to get right, and it's also the part I'm most certain about. AlphaCodium (Ridnik et al., 2024) reported an unexpected negative result: "injecting the last git patch diff to the prompt: no improvement was seen." Raw diffs alone are not a useful feedback signal — the Planner has no way to know whether the diff was actually correct. Diffs paired with test results are a different story. The diff says what changed; the test results say whether it worked.
I still want to do better here. Aider's repository map system uses tree-sitter to parse 26+ languages into ASTs and applies personalized PageRank to identify the most relevant symbols relative to the current edit. The result is a scope-aware elided code view that fits in a small token budget. My orchestrator doesn't do this yet — it just sends the full file tree. PageRank-based context selection is the highest-value thing I'd add next.
A counterintuitive finding from JetBrains Research (NeurIPS 2025 DL4C Workshop) is worth flagging: observation masking — hiding verbose tool outputs from older turns while preserving the action and reasoning history — matched or beat LLM summarization in 4 of 5 settings, while being simpler and cheaper. Both approaches cut token costs by 50%+ compared to unmanaged context. The lesson: don't summarize what you can hide.
The Feedback Loop
The loop is where the system earns its keep. After the Implementer reports back, the orchestrator runs git diff and the project's test command, then pipes both into the next Planner round.
That's when the interesting things start happening. The Planner notices that the Implementer added a route but forgot to register it in the router. It notices that a new function was added but no tests, and that the existing tests still pass — meaning the change isn't actually exercised. It notices that an edit touched an unrelated file by accident. It writes the next directive with all of that in hand: "good work on the middleware — now wire it into src/app.ts and add a test for the invalid-token case."
This matches the broader research consensus that external feedback signals are essential for iterative improvement to work. Self-Refine (Madaan et al., NeurIPS 2023) showed ~20% average improvement from generate-feedback-refine loops. Self-Debugging (Chen et al., ICLR 2024) gained up to +12% accuracy with execution feedback. Reflexion (Shinn et al., NeurIPS 2023) pushed HumanEval to 91% pass@1 by adding episodic memory of past failures.
But the same literature contains a sharp warning: intrinsic self-correction without external feedback degrades performance. "Large Language Models Cannot Self-Correct Reasoning Yet" (Huang et al., ICLR 2024) showed LLMs sometimes change correct answers to wrong ones when asked to self-evaluate without grounding. A TACL 2024 survey (Kamoi et al.) concluded that "no prior work demonstrates successful self-correction with feedback from prompted LLMs" — only with grounded signals like code execution.
The implication for any Planner/Implementer architecture is direct: the feedback loop is only as good as the external signals you wire into it. Diffs without tests are just opinion. Tests without diffs are missing the cause. You need both.
Running It
A full run looks like this:
python3 claude_loop.py ./my-app "Add JWT auth with refresh tokens" \
--effort max \
--planner-model opus \
--impl-model opus
The key options:
--effort [low | medium | max]— how aggressively each Claude instance should think before responding. Atmax, the Planner is notably more careful about edge cases.--planner-modeland--impl-model— pick the model for each role independently. Opus for both when the goal is tricky; Sonnet for both when speed matters; mixed when the plan is hard but the code is mechanical.--max-iterations— safety cap so a confused loop doesn't run all night.--timeout— per-turn timeout so a stuck subprocess doesn't hang the loop.
Knowing When to Stop
A loop that doesn't know when to stop is worse than no loop at all.
The empirical evidence on iteration is consistent: most improvement happens in the first 1–2 iterations, with rapidly diminishing returns. Self-Refine's own data shows the bulk of gain in iterations 0→2. Self-Debugging plateaus at 2–3 turns. AlphaCodium captures the largest gains on the first test-fixing pass.
There are also failure modes that get worse with more iterations:
- Self-bias amplification — LLMs systematically overrate their own generations, and the bias compounds monotonically across refinement rounds (Xu et al., 2024).
- Self-conditioning — per-step accuracy degrades as the agent sees more of its own prior outputs.
- Oscillation — agents can ping-pong between two states without converging. Claude Code's own production incident report (GitHub Issue #15909) describes a sub-agent that consumed 27M tokens in an infinite loop over 4.6 hours.
Termination in my orchestrator combines four signals, in order of priority:
- The Planner declares victory. After reviewing the most recent diff and test output, if the Planner concludes the goal is achieved, it emits a
GOAL COMPLETEsentinel and the orchestrator exits cleanly. - External validation gates pass. If all the project's tests pass, the linter is clean, and the diff matches the goal, that's strong evidence to stop.
- No-change detection. If two consecutive Implementer turns produce no meaningful diff, the loop has stalled and there's nothing more to learn from continuing.
- Iteration cap. Set with
--max-iterations, default 5. The escape hatch for runs stuck in a local minimum.
In practice, the first two mechanisms handle roughly 80% of completed runs. The cap catches the pathological cases.
The other 20% usually fails in a more subtle way: the Planner emits GOAL COMPLETE because the tests pass, but the feature is quietly incomplete — an edge case unhandled, an error path uncovered, a refactor that left dead code behind. A more principled termination check might dispatch a verification prompt to a third Claude instance, asking it to grade the final state against the original goal — though that adds the kind of complexity I deliberately tried to avoid.
Recent theoretical work on Markovian Generation Chains formalizes the intuition behind early termination — under greedy decoding, iterative LLM output rapidly converges to fixed points or short cycles. That's both an explanation for why three iterations is usually enough and a warning about why long runs don't help.
What You Can Build
I've used the loop on a range of tasks. Some it handles cleanly, others reveal the seams.
Typical runs by the numbers. A well-scoped bug fix completes in 2–3 iterations and costs roughly $0.50–$2.00 in API usage. A full feature like adding JWT auth runs 4–6 iterations at $3–$8. Test coverage campaigns are the most iteration-heavy — sometimes 8–10 rounds — but each round is cheap because the Implementer is mostly writing straightforward test code. Token usage scales roughly linearly with iteration count: about 15K–25K tokens per round for the Planner and 20K–40K for the Implementer.
Works well
- Well-scoped bug fixes where the failing test or error trace points at the right place.
- Full features that are mostly mechanical — adding routes, models, and a minimal test pass.
- Refactors where the goal can be stated precisely ("inline this helper, then delete it").
- Test coverage campaigns — "write tests until this file is at 80%."
Works poorly
- Open-ended architectural decisions. The Planner will commit to a direction and defend it; stepping back to reconsider is hard.
- Cross-repo or cross-service work. The loop only sees one repository.
- UI work that needs human judgment about visual quality.
- Greenfield feature development in unfamiliar codebases. FeatureBench found that even the strongest current frontier models score around 11% on feature development versus 74% on bug-fixing. Building new things is much harder than fixing existing ones, and the loop doesn't close that gap on its own.
What I'd Build Next
The architecture works. The interesting questions now are about the things around it.
- Repository-aware context selection. Borrow Aider's PageRank approach. Today my context is a flat file listing; the next version should weight files by relevance to the current goal.
- Tree-search rather than linear refinement. LATS (Zhou et al., ICML 2024) achieved 94.4% pass@1 on HumanEval by exploring multiple refinement paths via Monte Carlo Tree Search instead of iterating on a single one. For hard problems, branching beats iterating.
- RL-trained iteration. RLEF (Gehring et al., ICML 2025) showed that an RL-trained 8B model outperformed AlphaCodium's 100-sample approach with only 3 LLM responses. The implication: the loop's value depends on whether the model has been trained to use feedback well, not just on whether the loop exists.
- Sub-agent isolation for exploration. Claude Code itself uses sub-agents to delegate heavy file exploration to isolated context windows — 30K tokens of exploration returns as a 2K summary, a 93% context savings. My orchestrator should do the same for the Implementer's read phase.
Get Started
The full source is on GitHub: github.com/josephgec/claude-agent.
There's no installation beyond cloning the repo and having claude in your PATH. The orchestrator is a single Python file. Fork it, modify the system prompts, tune it to your taste — the architecture is simple enough that you can reshape it without fighting a framework.
If you try it on something interesting, I'd love to hear what worked and what broke. The most valuable feedback so far has been about the termination criterion — the Planner's GOAL COMPLETE call is fuzzy, and there's a more principled version hiding somewhere in the validation-gate plus no-change-detection combination.
References
A reference list of the research, blog posts, and open-source projects this article draws on. Most arXiv links open directly to the paper.
Multi-agent coding systems
- Hong et al. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. ICLR 2024. arXiv:2308.00352
- Qian et al. ChatDev: Communicative Agents for Software Development. ACL 2024. arXiv:2307.07924
- Wu et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. 2023. arXiv:2308.08155
- Fourney et al. Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks. Microsoft Research, November 2024. microsoft.com
- Yang et al. SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS 2024. arXiv:2405.15793
- Aider — Architect/Editor mode. aider.chat
Iterative self-refinement
- Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023. arXiv:2303.17651
- Chen et al. Teaching Large Language Models to Self-Debug. ICLR 2024. arXiv:2304.05128
- Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366
- Zhou et al. Language Agent Tree Search. ICML 2024. arXiv:2310.04406
- Ridnik et al. Code Generation with AlphaCodium. 2024. arXiv:2401.08500
- Huang et al. Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024. arXiv:2310.01798
- Olausson et al. Is Self-Repair a Silver Bullet for Code Generation? ICLR 2024. arXiv:2306.09896
- Kamoi et al. When Can LLMs Actually Correct Their Own Mistakes? TACL 2024. arXiv:2406.01297
- Gehring et al. RLEF: Grounding Code LLMs in Execution Feedback with RL. ICML 2025. arXiv:2410.02089
Planning and execution
- Wang et al. Plan-and-Solve Prompting. ACL 2023. arXiv:2305.04091
- Prasad et al. ADaPT: As-Needed Decomposition and Planning with Language Models. NAACL 2024 Findings. arXiv:2311.05772
- Xu et al. ReWOO: Decoupling Reasoning from Observations. 2023. arXiv:2305.18323
- Kim et al. LLMCompiler: An LLM Compiler for Parallel Function Calling. ICML 2024. arXiv:2312.04511
- Chen et al. GoalAct: Planning Decomposition Improves Coding Agents. 2025. arXiv:2504.16563
- Plan-and-Act. Google DeepMind, 2025. arXiv:2503.09572
- Schluntz & Zhang. Building Effective Agents. Anthropic, December 2024. anthropic.com
Context and termination
- Liu et al. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024. arXiv:2307.03172
- Fraser & Lindenbauer. Context Compression for Coding Agents. JetBrains Research, NeurIPS 2025 DL4C Workshop. arXiv:2508.21433
- Cemri et al. A Taxonomy of Multi-Agent Failures. 2025. arXiv:2503.13657
Benchmarks & tools
- Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770
- SWE-bench Pro. Scale AI, 2025. arXiv:2509.16941
- Jain et al. LiveCodeBench. 2024. arXiv:2403.07974
- Anthropic. Claude Code. anthropic.com/claude-code
- Claude Agent SDK. docs.claude.com
- Project source: github.com/josephgec/claude-agent