SigmaLoop
A self-improving agent loop that autonomously optimizes coding agents through continuous iteration, failure analysis, and gated evaluation — no model upgrades required.
Engineering is shifting. The job is no longer writing software — it’s maintaining systems that can observe their own failures, evolve their own quality layer, and improve their own operating harness over time.
Code generation is cheap. Modern coding systems can produce thousands of lines of working code in minutes, faster than any team can review, test, or fully understand. The bottleneck has moved. It is no longer writing code. It is everything that comes after: validating behavior, catching regressions, debugging failures, and maintaining reliability as systems evolve and user behaviors drift.
Unlike traditional software, where failures are deterministic and localized, agent systems fail in ways that are stochastic, distribution-dependent, and difficult to reproduce. Small changes in prompts, tool schemas, or context construction can lead to qualitatively different behaviors with compounding downstream consequences. Improvements are reactive. Complexity compounds. Over time, the system becomes harder to maintain.
What SigmaLoop Does
SigmaLoop addresses this by transforming raw failure signals into a structured improvement pipeline. You give it a benchmark and a single file to edit. It reads failure traces, clusters them by root cause, tightens the system prompt and agent architecture, gates every change against a self-maintained regression suite, and repeats — overnight, unattended, no human in the loop.
Each failure is analyzed to produce a representation of what went wrong, along with a hypothesis about its root cause. Failures are converted into reusable evaluation cases. The system proposes targeted changes to the agent harness, applies them, and validates the outcome against an evolving evaluation set. Every failure contributes to a persistent improvement rather than being resolved as a one-off fix.
The Self-Improvement Loop
How Each Phase Works
Phase A — Failure Mining. After each batch, the system scans execution traces from failed tasks and extracts structured failure records. It answers the central questions: what is the root cause of each failure? What failure patterns keep recurring? What should the agent have done differently? No manual labeling is required.
Phase B — Clustering & Prioritization. Failed tasks are grouped by shared root-cause mechanism into clusters. High failure count and low resolution rate identify the most systemic and unaddressed failure modes. Rather than treating failures independently, the system tracks and prioritizes them at the level of underlying patterns — enabling more efficient coverage of the error space.
Phase C — Optimization Loop. Within a fixed iteration budget, the system proposes targeted harness changes addressing root-cause clusters. Changes span the full stack: prompt design, few-shot examples, tool interfaces, context construction, and workflow architecture. Each proposed change runs through a three-step gate:
- Regression suite — Previously fixed tasks must keep passing at ≥ 80% threshold. This is the system’s memory.
- Full benchmark — Mean reward on the held-out test split must meet or exceed the best score on record.
- Suite promotion — If both gates pass, newly-passing tasks are promoted into the regression suite.
Nothing is committed without clearing all three steps. If any step fails, the change is reverted and the system tries a different approach.
Phase D — Regression Suite Maintenance. The regression set is not a static benchmark — it’s a living collection of cases that evolves with the agent. Resolved failures are permanently encoded into the suite. Each improvement cycle makes it harder to accidentally regress, forcing each subsequent improvement to be genuinely additive.
Results
SigmaLoop ran completely autonomously for 18 batches, executing 96 harness experiments with a fixed GPT-4 model. No fine-tuning, no model upgrade — all gains come purely from agent harness improvements. The underlying model is intentionally fixed to isolate gains from the system itself.
Agent Performance
Agent performance on the validation set improves from 0.56 → 0.78 over 96 iterations of harness optimization. At each iteration, the system explores multiple candidate updates, retaining only those that both improve validation performance and satisfy the regression gate (≥ 80%). In later stages, most candidate changes are rejected as the regression gate prevents any update that reintroduces previously fixed failure modes. As experiments progress, the optimization problem becomes harder, forcing each improvement to be additive — shifting reliability from a manual debugging loop to an automated improvement process.
Failure Cluster Discovery
| CLUSTER | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | STATUS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Wrong order identification | +1 | +1 | 50% | 2 | 2 | 2 | 2 | ✓ | Fully resolved | ||||||||||
| Product variant mismatch | +1 | 1 | 1 | 1 | 1 | ✓ | Fully resolved | ||||||||||||
| Roaming/data limit handling | +1 | 1 | 1 | 1 | ✓ | Fully resolved | |||||||||||||
| Cabin downgrade payment confusion | +1 | ✓ | Fully resolved | ||||||||||||||||
| Cheapest flight not selected | +1 | ✓ | Fully resolved | ||||||||||||||||
| Tool dispatch errors | +1 | 1 | 1 | 1 | 1 | ✓ | Fully resolved | ||||||||||||
| Insurance scope misapplied | +1 | 1 | 1 | 1 | 1 | 1 | Active | ||||||||||||
| Device reboot sequencing | +1 | 1 | 1 | 1 | 1 | Active | |||||||||||||
| State tracking gaps | +1 | 1 | 1 | 1 | 1 | Active | |||||||||||||
| Multi-order context confusion | +1 | 1 | 1 | 1 | 1 | ✓ | Fully resolved |
SigmaLoop automatically discovered 29+ distinct failure clusters from execution traces, without any manual labeling. Failures are treated as recurring patterns rather than isolated incidents. As clusters are resolved, they are incorporated into the regression set, preventing recurrence. High-impact failure modes are systematically identified, prioritized, and driven toward resolution.
Regression Set Evolution
The regression suite grows from 0 to 17 test cases across 18 batches, with each resolved failure cluster contributing new cases. The ≥ 80% gate is enforced throughout, rejecting any iteration that regresses on known failures. The evaluation set is not static — it evolves with the system. Each fix becomes a permanent constraint, making future improvements harder but more reliable, and ensuring progress compounds without backsliding.
Design Principles
One file to edit, everything else locked. The coding agent only touches agent/agent.py. The benchmark runner, gate logic, and recording infrastructure are immutable from the agent’s perspective. This separation gives the loop a stable contract: change the agent, measure the change, gate the change.
The regression suite is a memory of every fix. Each committed improvement promotes newly-fixed tasks into suite.json. Future changes must keep passing them — or get rejected. The suite grows tighter over time, functioning as a living proxy validation set.
Learnings survive session boundaries. After every iteration, the agent writes to workspace/learnings.md: what it tried, what the failure pattern was, what worked. Reading this at session start restores full context without re-running diagnostics.
High rejection rate is the goal. The gate’s job is to catch changes that overfit or regress. If the acceptance rate is high, the gate is too loose. In this experiment, 85% of changes were rejected — and every accepted commit was a real, generalizing improvement.
Benchmark-agnostic architecture. While the reference implementation uses tau2, the loop, gate, and recording layer are entirely decoupled. Subclass BenchmarkRunner, return {task_id: reward}, and the rest works unchanged.
Agent improvement is a search problem with many rejected paths. SigmaLoop provides the infrastructure — the loop, the gate, and the memory — so the agent can navigate that space autonomously.