jthomas.site// notebook · v.4.2026
back to writing

When Swarms Write Code — How Particle Swarm Optimization Escapes the Local Minima Trap in ARC-AGI

Standard LLM agents get stuck in repetitive loops on novel reasoning puzzles — structurally equivalent to converging on a local minimum. A swarm of specialized LLM particles governed by Particle Swarm Optimization, with a continuous fitness function that rewards near misses, searches the space of possible programs systematically instead of spinning in place.

TL;DR — Standard LLM agents, when faced with novel reasoning puzzles, tend to get stuck in repetitive generation cycles — a phenomenon that's structurally equivalent to converging on a local minimum in optimization. A new open-source project attacks this problem by replacing the single-agent loop with a swarm of specialized LLM-powered particles governed by Particle Swarm Optimization. The result is a system that searches the space of possible programs systematically rather than randomly, and rewards "near misses" rather than demanding perfection at every step.

Why ARC-AGI Breaks Chain-of-Thought

François Chollet's Abstraction and Reasoning Corpus (ARC-AGI) remains one of the most unforgiving benchmarks in AI. Each task is a handful of small coloured grids organised into input–output pairs. Together the pairs demonstrate some visual transformation rule — rotate, recolor, tile, fill, extract a sub-shape — and the solver has to infer that rule from as few as two or three examples, then apply it to an unseen test input. There are no training distributions to memorize, no statistical shortcuts. The benchmark is a direct test of fluid intelligence: the ability to solve problems you have never seen before.

ANATOMY OF AN ARC-AGI PUZZLE A training pair reveals the rule. A held-out test input must be solved. train · input → output input output test · solve me ? test input predict this INFERRED RULE recolour every cyan cell to oxblood, leave shape and position untouched.
An ARC-AGI puzzle. The solver sees a few input–output examples, must infer the transformation rule, and then apply it to an unseen test input. Grids are tiny (often 3×3 to 30×30) and use a 10-colour palette.

Large Language Models, despite their remarkable breadth, struggle here. The standard approach — wrap a model in a Chain-of-Thought agent loop, generate a candidate program, test it, feed the error back, and try again — works for well-structured tasks but collapses on ARC. The failure mode is predictable: the agent proposes a hypothesis, discovers it's wrong, tweaks it slightly, discovers it's still wrong, tweaks it again along the same axis, and spirals into a repetitive generation cycle. Anyone who has watched an LLM debug its own code recognizes the pattern. In optimization terms, the agent has converged on a local minimum — a point in solution space surrounded by worse-looking neighbours, so every small step looks like a regression even though a much better solution lives just over the next ridge — and lacks the mechanism to escape.

BEFORE THIS Earlier ARC attempts pushed on the single-agent recipe in three ways. Self-Refine-style reflection loops (Madaan et al., 2023) had one model critique its own code and rewrite it; useful for shallow bugs but blind to its own reasoning ruts. Brute sampling at scale — Greenblatt's GPT-4o + ~8k samples per task approach — set a 50% SoTA, but at thousands of dollars per puzzle and effectively no learning signal between samples. Tree search variants (MCTS over edits, AlphaCode-style program enumeration) explored more aggressively, but the binary pass/fail evaluation gave the search no gradient: a program that gets 90% of pixels right and one that gets 5% right look identical. All three failures share a shape: one trajectory at a time, no continuous progress signal.

The ARC-AGI PSO Swarm Solver is built on a single insight: if the problem is convergence to local minima, the solution is an optimizer specifically designed to avoid them. Particle Swarm Optimization is exactly that optimizer.

Swarm vs. Single Agent

Why PSO?

Particle Swarm Optimization (PSO), introduced by Kennedy and Eberhart in 1995, simulates a flock of birds searching for food in an unfamiliar landscape. The vocabulary is worth pinning down once, because everything that follows uses it:

At every iteration, each particle is pulled in two directions simultaneously: toward its own pbest (its memory) and toward the swarm's gbest (the social signal). The inertia term keeps it from oscillating wildly. That dual pull is the entire reason PSO escapes local minima: even if one particle has settled into a flat region, the social tug from a better-performing neighbour can yank it out — the way a bird that smells food calls the rest of the flock over.

Compare this to a single-agent loop. Single agent stuck: one trajectory, no peers, no memory of where else it has been. Each "tweak" is a small step from the current wrong answer; if the local error landscape says every small step is worse, the agent rephrases the same idea and loops. Swarm escapes: six trajectories starting in different places, each remembering its own best, all listening to the swarm's best. When one particle finds a better basin, the social pull gradually bends every other particle toward it without forcing instant collapse — diversity is preserved long enough to keep exploring.

The update equations, as implemented in pso_orchestrator.py, are two lines:

# one step per iteration, per particle i
v_i      =  w · v_i  +  c1 · r1 · (pbest_i - x_i)  +  c2 · r2 · (gbest - x_i)
                ↑              ↑                          ↑
                inertia        cognitive (pull to        social (pull to
                (keep going)   own best memory)          swarm's best)

target_i =  normalize(x_i + v_i)
                
                project onto the 768-d unit sphere — that's where embeddings live

# w  = 0.5         inertia weight
# c1 = c2 = 1.5    cognitive and social coefficients (equal pull)
# r1, r2 ~ U[0,1]  fresh random scalars each step → stochastic jitter

Read in plain English, line one says: "my next direction is some of where I was already going (w · v_i), plus a noisy nudge back toward my own best memory (c1·r1·(pbest_i − x_i)), plus a noisy nudge toward the swarm's best (c2·r2·(gbest − x_i))." Line two adds that direction to the current position and renormalizes — every particle position is a unit vector on the 768-d embedding sphere. The output is a single target point: a direction the particle should now try to occupy.

ONE PSO ITERATION · PER PARTICLE 1 · run execute program on training pairs in sandbox 2 · score continuous fitness dim · color · pixel ∈ [0, 1] 3 · update v PSO equation w·v + c1·(pbest−x) + c2·(gbest−x) 4 · move LLM rewrites code project nearest candidate 5 report pbest gbest repeat until fitness = 1.0 or budget exhausted All six particles run steps 1–5 in parallel each iteration. Step 5 updates the shared gbest that step 3 reads next iteration.
One PSO iteration, broken out per particle. The expensive parts are step 1 (sandboxed code execution) and step 4 (LLM generation). Steps 2 and 3 are cheap arithmetic on embedding vectors.

The Inverse Mapping Problem

There's a catch. PSO operates in continuous space — real-valued vectors. But the "solutions" in ARC are programs: discrete sequences of Python code. You can easily embed code into a continuous vector (that's what embedding models do), but you cannot decode an arbitrary floating-point vector back into valid Python. The mapping is one-directional. This is the Inverse Mapping Problem, and it's the central technical obstacle the project had to solve.

The Generate-and-Project Bridge

The solution is elegant. Rather than trying to invert the embedding, the system uses the LLM as a generative projector:

  1. PSO computes a target vector in continuous embedding space, representing the direction a particle should explore.
  2. The LLM generates K candidate programs (default K=5), each blending logic from the particle's personal best and the swarm's global best.
  3. Each candidate is embedded into the same 768-dimensional space using nomic-embed-text.
  4. The candidate closest to the PSO target (by cosine distance) is selected.

This is the Generate-and-Project bridge — the mechanism that decouples search direction (computed mathematically by PSO) from solution generation (performed semantically by the LLM). The swarm handles exploration strategy; the LLM handles the syntax.

Continuous Embedding Space 768-d · real-valued vectors · cosine geometry gbest PSO target GENERATE-AND-PROJECT BRIDGE 1 PSO target direction in 768-d 2 LLM generates K=5 candidate programs 3 Embed each back into 768-d 4 Pick closest by cosine distance Discrete Code Space Python programs · no valid interpolation def transform(g): g = rotate(g, 90) return recolor( g, 3, 7) candidate 1 def transform(g): o = find_objects( g) return scale(o, 2) candidate 2 def transform(g): g = flip(g, "h") return tile(g, rows=2, cols=2) selected ★ def transform(g): m = mask(g, 0) return flood_fill( g, m, 5) candidate 4 def transform(g): b = bounding_box( g) return crop(g, b) candidate 5
The Generate-and-Project bridge. PSO picks a direction in continuous space; the LLM generates K candidate programs; each candidate is re-embedded; the one closest to the PSO target is selected and becomes the particle's new position.

Specialized Roles: Diversity by Design

A well-known failure mode of PSO is premature convergence — all particles collapsing to the same region before the space has been adequately explored. The standard countermeasure is population diversity, and this project enforces it through specialist roles.

Each of the six particles in the swarm is initialized with a distinct cognitive bias:

RoleFocus Area
Geometric SpecialistRotation, flipping, translation, scaling
Color SpecialistRecoloring, masking, palette operations
Pattern AnalystPeriodic tiling, symmetry detection
Object TrackerConnected-component analysis, bounding boxes
Rule AbstractorMinimal abstract rules, clean generalization
Hybrid SolverHolistic reasoning across all strategies

These roles aren't just labels. Each particle's initialization prompt is tailored to its specialty, meaning the swarm's starting population covers the major reasoning strategies observed across ARC tasks. A geometric specialist will propose rotations and reflections; a color specialist will explore palette remappings. Even if one approach is a dead end, the others keep searching.

This is a departure from typical multi-agent LLM systems, where agents are often differentiated by tone or persona. Here, the differentiation is functional — it directly shapes the region of program space each particle explores during its first iterations.

Advanced Mechanics

Hypothesis-Guided Initialization

Before the PSO loop begins, a Hypothesizer agent (defined in roles.py) examines the training examples and generates three competing natural-language hypotheses about what the transformation rule might be. This isn't idle speculation — it gives the swarm a warm start. The initial code generated for each particle is informed by these hypotheses, meaning the swarm begins its search in a neighborhood that's at least plausibly correct, rather than starting from random programs.

The Multi-Agent Fallback

The PSO solver doesn't exist in isolation. The repository also implements a full multi-agent pipeline — Hypothesizer, Coder, Critic, Decomposer, and Verifier — organized as a state machine in multi_agent.py. The Critic reads spatial diffs (e.g., "object shifted 2 rows down," "bottom-right region: 3 wrong cells, expected blue, got red") and routes the conversation either back to the Coder for an implementation fix, or back to the Hypothesizer for a fundamentally new approach. A Decomposer fires when stagnation is detected (two or more consecutive non-improving cycles), breaking the task into sub-goals to restart the search.

This multi-agent loop is the backbone of the non-PSO strategies, and it also feeds an ensemble pipeline where multiple independent runs are pooled and resolved via pixel-level weighted majority voting.

Crossover and Stagnation Recovery

Within the PSO loop itself, the PSOCoder role (the LLM prompt that generates candidate mutations) is explicitly instructed to blend the logic of pbest_code and gbest_code. This is conceptually identical to crossover in genetic algorithms — combining successful sub-programs from two parents to produce offspring that inherit the strengths of both. If a particle's personal best knows how to get the colors right, and the global best knows how to get the shape right, the LLM is prompted to merge those two insights into a single program.

When the entire swarm stagnates, the velocity update equation naturally increases the social pull toward gbest, causing particles to converge — but because each new iteration generates fresh candidates from the LLM, even converged particles can escape if the LLM introduces a novel variation. The stochastic jitter from r1 and r2 at every step prevents the swarm from locking permanently.

The Continuous Fitness Function: Rewarding Near Misses

Perhaps the most consequential design choice in the entire system is the decision to replace ARC's native binary evaluation (pass/fail) with a continuous fitness function. PSO needs a gradient — not in the calculus sense, but in the sense that it needs to know whether a particle is getting warmer. A binary signal provides no information about how close a wrong answer actually was.

A fitness function, in optimization vocabulary, is just a number you can compute on any candidate that says "how good is this." Bigger is better. The whole job of the swarm is to climb the fitness function. The continuous fitness function used here, implemented in evaluate.py, decomposes correctness into three weighted components:

fitness =  0.20 · dim_score      # is the output the right shape?
          + 0.30 · color_score    # are the right colours present?
          + 0.50 · pixel_score    # are individual cells correct?
          - ast_complexity_penalty   # up to −0.15 for hardcoded lookup tables

Dimension score (20%) measures whether the output grid has the right shape. If the predicted grid is 5×3 but the target is 5×5, this component penalizes proportionally rather than returning zero.

Color score (30%) uses the Jaccard index over color palettes. A program that uses the correct set of colors — even if they're in the wrong positions — gets partial credit. This is a strong signal: if you have the right palette, you're probably on the right track.

Pixel score (50%) counts cell-by-cell matches. This is the dominant term, but crucially, it's not all-or-nothing. A program that gets 80% of cells correct scores 0.4 on this component alone, which is enough to make it a competitive pbest and potentially the new gbest.

An AST complexity penalty is also applied, deducting up to 0.15 from the raw fitness for programs that show signs of memorization — excessive if statements, large hardcoded literals, and comparison chains. This keeps the swarm honest: it must find generalizable rules, not lookup tables.

The result is a fitness landscape with smooth gradients that PSO can actually navigate. A typical progression looks like:

iteration  fitness   what happened
0          0.12      random code · wrong shape · wrong colors
1          0.31      right colors emerging
2          0.54      correct shape · ~50% pixel accuracy
3          0.83      logic mostly right · edge cases wrong
4          1.00      solved

Each step is meaningful. Each step gives the swarm a direction.

1.0 0.75 0.5 0.25 0.0 fitness program space (1-d projection) local max single-agent trap gbest · global optimum single agent · climbs once, oscillates pbest₁ pbest₂ pbest₃ particle (current x) pbest · particle's own best social pull → gbest Single-agent climbs the nearest hill and oscillates. The swarm holds six trajectories, each remembering its pbest, all gently pulled toward gbest.
A 1-D cross-section of the fitness landscape. A single-agent search climbs to the nearest local maximum and oscillates around it. The swarm spreads particles across the landscape; each remembers its personal best, all are pulled toward the swarm-wide global best, and at least one particle finds the true peak.

The Tech Stack

Python NumPy Ollama nomic-embed-text Anthropic (optional)

The project is deliberately lean. Python is the implementation language, with NumPy for grid operations and the DSL primitives. Ollama provides local LLM inference with tiered model selection — the start_ollama.sh script auto-configures for available VRAM, from an 8 GB deepseek-r1:8b + qwen2.5-coder:7b setup up to a 32 GB deepseek-r1:32b configuration. Embeddings are always handled by nomic-embed-text, a 768-dimensional open-weight model running locally with no API costs.

The custom ARC DSL (dsl.py) provides a set of pure transformation primitives — crop, rotate, flip, translate, scale, tile, recolor, mask, overlay, flood_fill, find_objects, bounding_box, crop_to_content — that are the only operations permitted inside generated transform() functions. These are injected into a hardened sandbox (sandbox.py) that executes untrusted code in a separate child process with a 10-second wall-clock timeout and no network or filesystem access.

An optional Anthropic backend is also supported — the same swarm architecture can use Claude models for generation while keeping embeddings local through Ollama. The test suite covers 496 tests at 94% code coverage, with all LLM calls mocked for fully offline execution.

Closing Thoughts: What the Hybrid Tells Us

This project is small in scale — six particles, a handful of local models, 400 training tasks — but the architecture it demonstrates has implications that extend well beyond ARC.

The failure of single-agent loops is structural, not parametric. You can't fix a local-minimum problem by making the model larger or the prompt longer. You fix it by changing the search algorithm. The ARC-AGI PSO Swarm Solver makes this argument concretely: the same LLM that gets stuck in a single-agent loop becomes dramatically more effective when its outputs are governed by a population-based optimizer.

The Generate-and-Project bridge is a general pattern. Any problem where the solution space is discrete but evaluation is continuous — code synthesis, molecular design, circuit layout, drug discovery — could benefit from this architecture. The LLM handles the generative step (producing valid candidates in the discrete space), and the optimizer handles the search step (deciding which direction to explore next in continuous space). Neither could do the other's job.

Continuous fitness functions unlock optimization for symbolic domains. The decision to reward near misses rather than demanding perfection is what makes PSO viable here. Binary evaluation gives one bit of information per candidate. Continuous evaluation gives a real-valued gradient. That difference is the difference between random search and directed search.

Diversity is not a luxury; it is a mechanism. The specialist roles are not a gimmick — they are the primary defense against premature convergence. This echoes a deep insight from evolutionary computation: maintaining population diversity isn't about fairness or coverage for its own sake, but about preserving the swarm's ability to escape local optima.

Whether this particular architecture will scale to the hardest tiers of ARC-AGI remains an open question. But the intellectual contribution is clear: the best use of an LLM in a reasoning system may not be as the reasoner, but as the mutation engine in a larger optimization loop. The swarm provides the strategy. The LLM provides the syntax. Together, they search a space that neither could navigate alone.

2-D PROJECTION OF 768-d EMBEDDING SPACE gbest geom color pattern object rule hybrid SPECIALISTS Geometric Color Pattern Object Rule Hybrid gbest velocity
Snapshot of the swarm in a 2-D projection. Each specialist starts in its own region of the search space and is pulled toward the swarm's best-so-far, but its starting bias keeps exploration diverse.

References

The research and tools this project builds on, for readers who want to go deeper.

Swarm intelligence and PSO

ARC-AGI and reasoning benchmarks

LLM code generation and failure modes

Embeddings and tools

Project source