Back to Writing

When Swarms Write Code — How Particle Swarm Optimization Escapes the Local Minima Trap in ARC-AGI

Standard LLM agents get stuck in repetitive loops on novel reasoning puzzles — structurally equivalent to converging on a local minimum. A swarm of specialized LLM particles governed by Particle Swarm Optimization, with a continuous fitness function that rewards near misses, searches the space of possible programs systematically instead of spinning in place.

TL;DR — Standard LLM agents, when faced with novel reasoning puzzles, tend to get stuck in repetitive generation cycles — a phenomenon that's structurally equivalent to converging on a local minimum in optimization. A new open-source project attacks this problem by replacing the single-agent loop with a swarm of specialized LLM-powered particles governed by Particle Swarm Optimization. The result is a system that searches the space of possible programs systematically rather than randomly, and rewards "near misses" rather than demanding perfection at every step.

Why ARC-AGI Breaks Chain-of-Thought

François Chollet's Abstraction and Reasoning Corpus remains one of the most unforgiving benchmarks in AI. Each task consists of a handful of input–output grid pairs that demonstrate some visual transformation rule — rotate, recolor, tile, fill, extract — and the solver must infer that rule from as few as two or three examples, then apply it to an unseen test input. There are no training distributions to memorize, no statistical shortcuts. The benchmark is a direct test of fluid intelligence: the ability to solve problems you have never seen before.

Large Language Models, despite their remarkable breadth, struggle here. The standard approach — wrap a model in a Chain-of-Thought agent loop, generate a candidate program, test it, feed the error back, and try again — works for well-structured tasks but collapses on ARC. The failure mode is predictable: the agent proposes a hypothesis, discovers it's wrong, tweaks it slightly, discovers it's still wrong, tweaks it again along the same axis, and spirals into a repetitive generation cycle. Anyone who has watched an LLM debug its own code recognizes the pattern. In optimization terms, the agent has converged on a local minimum — a region of solution space that's locally attractive but globally suboptimal — and lacks the mechanism to escape.

The ARC-AGI PSO Swarm Solver is built on a single insight: if the problem is convergence to local minima, the solution is an optimizer specifically designed to avoid them. Particle Swarm Optimization is exactly that optimizer.

Swarm vs. Single Agent

Why PSO?

PSO, introduced by Kennedy and Eberhart in 1995, simulates a flock of birds searching for food. Each "particle" in the swarm maintains its own position and velocity in a search space. Particles are pulled in two directions simultaneously: toward their own personal best (pbest) — the best solution they have individually discovered — and toward the global best (gbest) — the best solution any particle in the swarm has found. An inertia term keeps particles from oscillating too aggressively. The interplay between personal memory and social information sharing is what gives PSO its ability to escape local minima: even if one particle gets stuck, the social pull from a better-performing neighbor can yank it into a more promising region.

The update equations, as implemented in pso_orchestrator.py, are:

# velocity and target update — one step per iteration, per particle
v_i      =  w · v_i  +  c1 · r1 · (pbest_i - x_i)  +  c2 · r2 · (gbest - x_i)
target_i =  normalize(x_i + v_i)

# w  = 0.5        inertia
# c1 = c2 = 1.5   cognitive and social coefficients
# r1, r2 ~ U[0,1] stochastic jitter

The result is a target point on a 768-dimensional unit sphere — a direction the particle should move toward.

The Inverse Mapping Problem

There's a catch. PSO operates in continuous space — real-valued vectors. But the "solutions" in ARC are programs: discrete sequences of Python code. You can easily embed code into a continuous vector (that's what embedding models do), but you cannot decode an arbitrary floating-point vector back into valid Python. The mapping is one-directional. This is the Inverse Mapping Problem, and it's the central technical obstacle the project had to solve.

The Generate-and-Project Bridge

The solution is elegant. Rather than trying to invert the embedding, the system uses the LLM as a generative projector:

  1. PSO computes a target vector in continuous embedding space, representing the direction a particle should explore.
  2. The LLM generates K candidate programs (default K=5), each blending logic from the particle's personal best and the swarm's global best.
  3. Each candidate is embedded into the same 768-dimensional space using nomic-embed-text.
  4. The candidate closest to the PSO target (by cosine distance) is selected.

This is the Generate-and-Project bridge — the mechanism that decouples search direction (computed mathematically by PSO) from solution generation (performed semantically by the LLM). The swarm handles exploration strategy; the LLM handles the syntax.

Continuous Embedding Space 768-d · real-valued vectors · cosine geometry gbest PSO target GENERATE-AND-PROJECT BRIDGE 1 PSO target direction in 768-d 2 LLM generates K=5 candidate programs 3 Embed each back into 768-d 4 Pick closest by cosine distance Discrete Code Space Python programs · no valid interpolation def transform(g): g = rotate(g, 90) return recolor( g, 3, 7) candidate 1 def transform(g): o = find_objects( g) return scale(o, 2) candidate 2 def transform(g): g = flip(g, "h") return tile(g, rows=2, cols=2) selected ★ def transform(g): m = mask(g, 0) return flood_fill( g, m, 5) candidate 4 def transform(g): b = bounding_box( g) return crop(g, b) candidate 5
The Generate-and-Project bridge. PSO picks a direction in continuous space; the LLM generates K candidate programs; each candidate is re-embedded; the one closest to the PSO target is selected and becomes the particle's new position.

Specialized Roles: Diversity by Design

A well-known failure mode of PSO is premature convergence — all particles collapsing to the same region before the space has been adequately explored. The standard countermeasure is population diversity, and this project enforces it through specialist roles.

Each of the six particles in the swarm is initialized with a distinct cognitive bias:

RoleFocus Area
Geometric SpecialistRotation, flipping, translation, scaling
Color SpecialistRecoloring, masking, palette operations
Pattern AnalystPeriodic tiling, symmetry detection
Object TrackerConnected-component analysis, bounding boxes
Rule AbstractorMinimal abstract rules, clean generalization
Hybrid SolverHolistic reasoning across all strategies

These roles aren't just labels. Each particle's initialization prompt is tailored to its specialty, meaning the swarm's starting population covers the major reasoning strategies observed across ARC tasks. A geometric specialist will propose rotations and reflections; a color specialist will explore palette remappings. Even if one approach is a dead end, the others keep searching.

This is a departure from typical multi-agent LLM systems, where agents are often differentiated by tone or persona. Here, the differentiation is functional — it directly shapes the region of program space each particle explores during its first iterations.

Advanced Mechanics

Hypothesis-Guided Initialization

Before the PSO loop begins, a Hypothesizer agent (defined in roles.py) examines the training examples and generates three competing natural-language hypotheses about what the transformation rule might be. This isn't idle speculation — it gives the swarm a warm start. The initial code generated for each particle is informed by these hypotheses, meaning the swarm begins its search in a neighborhood that's at least plausibly correct, rather than starting from random programs.

The Multi-Agent Fallback

The PSO solver doesn't exist in isolation. The repository also implements a full multi-agent pipeline — Hypothesizer, Coder, Critic, Decomposer, and Verifier — organized as a state machine in multi_agent.py. The Critic reads spatial diffs (e.g., "object shifted 2 rows down," "bottom-right region: 3 wrong cells, expected blue, got red") and routes the conversation either back to the Coder for an implementation fix, or back to the Hypothesizer for a fundamentally new approach. A Decomposer fires when stagnation is detected (two or more consecutive non-improving cycles), breaking the task into sub-goals to restart the search.

This multi-agent loop is the backbone of the non-PSO strategies, and it also feeds an ensemble pipeline where multiple independent runs are pooled and resolved via pixel-level weighted majority voting.

Crossover and Stagnation Recovery

Within the PSO loop itself, the PSOCoder role (the LLM prompt that generates candidate mutations) is explicitly instructed to blend the logic of pbest_code and gbest_code. This is conceptually identical to crossover in genetic algorithms — combining successful sub-programs from two parents to produce offspring that inherit the strengths of both. If a particle's personal best knows how to get the colors right, and the global best knows how to get the shape right, the LLM is prompted to merge those two insights into a single program.

When the entire swarm stagnates, the velocity update equation naturally increases the social pull toward gbest, causing particles to converge — but because each new iteration generates fresh candidates from the LLM, even converged particles can escape if the LLM introduces a novel variation. The stochastic jitter from r1 and r2 at every step prevents the swarm from locking permanently.

The Continuous Fitness Function: Rewarding Near Misses

Perhaps the most consequential design choice in the entire system is the decision to replace ARC's native binary evaluation (pass/fail) with a continuous fitness function. PSO needs a gradient — not in the calculus sense, but in the sense that it needs to know whether a particle is getting warmer. A binary signal provides no information about how close a wrong answer actually was.

The continuous fitness function, implemented in evaluate.py, decomposes correctness into three weighted components:

fitness =  0.20 · dim_score
          + 0.30 · color_score
          + 0.50 · pixel_score

Dimension score (20%) measures whether the output grid has the right shape. If the predicted grid is 5×3 but the target is 5×5, this component penalizes proportionally rather than returning zero.

Color score (30%) uses the Jaccard index over color palettes. A program that uses the correct set of colors — even if they're in the wrong positions — gets partial credit. This is a strong signal: if you have the right palette, you're probably on the right track.

Pixel score (50%) counts cell-by-cell matches. This is the dominant term, but crucially, it's not all-or-nothing. A program that gets 80% of cells correct scores 0.4 on this component alone, which is enough to make it a competitive pbest and potentially the new gbest.

An AST complexity penalty is also applied, deducting up to 0.15 from the raw fitness for programs that show signs of memorization — excessive if statements, large hardcoded literals, and comparison chains. This keeps the swarm honest: it must find generalizable rules, not lookup tables.

The result is a fitness landscape with smooth gradients that PSO can actually navigate. A typical progression looks like:

iteration  fitness   what happened
0          0.12      random code · wrong shape · wrong colors
1          0.31      right colors emerging
2          0.54      correct shape · ~50% pixel accuracy
3          0.83      logic mostly right · edge cases wrong
4          1.00      solved

Each step is meaningful. Each step gives the swarm a direction.

1.0 0.75 0.5 0.25 0.0 fitness program space (1-d projection) local max single-agent trap global optimum single-agent: gets stuck swarm: spreads & finds global
A 1-D cross-section of the fitness landscape. A single-agent, greedy-ascent search climbs to the nearest local maximum and stops. The swarm spreads across the landscape and some particles eventually find the global optimum.

The Tech Stack

Python NumPy Ollama nomic-embed-text Anthropic (optional)

The project is deliberately lean. Python is the implementation language, with NumPy for grid operations and the DSL primitives. Ollama provides local LLM inference with tiered model selection — the start_ollama.sh script auto-configures for available VRAM, from an 8 GB deepseek-r1:8b + qwen2.5-coder:7b setup up to a 32 GB deepseek-r1:32b configuration. Embeddings are always handled by nomic-embed-text, a 768-dimensional open-weight model running locally with no API costs.

The custom ARC DSL (dsl.py) provides a set of pure transformation primitives — crop, rotate, flip, translate, scale, tile, recolor, mask, overlay, flood_fill, find_objects, bounding_box, crop_to_content — that are the only operations permitted inside generated transform() functions. These are injected into a hardened sandbox (sandbox.py) that executes untrusted code in a separate child process with a 10-second wall-clock timeout and no network or filesystem access.

An optional Anthropic backend is also supported — the same swarm architecture can use Claude models for generation while keeping embeddings local through Ollama. The test suite covers 496 tests at 94% code coverage, with all LLM calls mocked for fully offline execution.

Closing Thoughts: What the Hybrid Tells Us

This project is small in scale — six particles, a handful of local models, 400 training tasks — but the architecture it demonstrates has implications that extend well beyond ARC.

The failure of single-agent loops is structural, not parametric. You can't fix a local-minimum problem by making the model larger or the prompt longer. You fix it by changing the search algorithm. The ARC-AGI PSO Swarm Solver makes this argument concretely: the same LLM that gets stuck in a single-agent loop becomes dramatically more effective when its outputs are governed by a population-based optimizer.

The Generate-and-Project bridge is a general pattern. Any problem where the solution space is discrete but evaluation is continuous — code synthesis, molecular design, circuit layout, drug discovery — could benefit from this architecture. The LLM handles the generative step (producing valid candidates in the discrete space), and the optimizer handles the search step (deciding which direction to explore next in continuous space). Neither could do the other's job.

Continuous fitness functions unlock optimization for symbolic domains. The decision to reward near misses rather than demanding perfection is what makes PSO viable here. Binary evaluation gives one bit of information per candidate. Continuous evaluation gives a real-valued gradient. That difference is the difference between random search and directed search.

Diversity is not a luxury; it is a mechanism. The specialist roles are not a gimmick — they are the primary defense against premature convergence. This echoes a deep insight from evolutionary computation: maintaining population diversity isn't about fairness or coverage for its own sake, but about preserving the swarm's ability to escape local optima.

Whether this particular architecture will scale to the hardest tiers of ARC-AGI remains an open question. But the intellectual contribution is clear: the best use of an LLM in a reasoning system may not be as the reasoner, but as the mutation engine in a larger optimization loop. The swarm provides the strategy. The LLM provides the syntax. Together, they search a space that neither could navigate alone.

2-D PROJECTION OF 768-d EMBEDDING SPACE gbest geom color pattern object rule hybrid SPECIALISTS Geometric Color Pattern Object Rule Hybrid gbest velocity
Snapshot of the swarm in a 2-D projection. Each specialist starts in its own region of the search space and is pulled toward the swarm's best-so-far, but its starting bias keeps exploration diverse.

References

The research and tools this project builds on, for readers who want to go deeper.

Swarm intelligence and PSO

ARC-AGI and reasoning benchmarks

LLM code generation and failure modes

Embeddings and tools

Project source