SigmaLoop | Kohsheen Tiku

Engineering is shifting. The job is no longer writing software — it’s maintaining systems that can observe their own failures, evolve their own quality layer, and improve their own operating harness over time.

Code generation is cheap. Modern coding systems can produce thousands of lines of working code in minutes, faster than any team can review, test, or fully understand. The bottleneck has moved. It is no longer writing code. It is everything that comes after: validating behavior, catching regressions, debugging failures, and maintaining reliability as systems evolve and user behaviors drift.

Unlike traditional software, where failures are deterministic and localized, agent systems fail in ways that are stochastic, distribution-dependent, and difficult to reproduce. Small changes in prompts, tool schemas, or context construction can lead to qualitatively different behaviors with compounding downstream consequences. Improvements are reactive. Complexity compounds. Over time, the system becomes harder to maintain.

What SigmaLoop Does

SigmaLoop addresses this by transforming raw failure signals into a structured improvement pipeline. You give it a benchmark and a single file to edit. It reads failure traces, clusters them by root cause, tightens the system prompt and agent architecture, gates every change against a self-maintained regression suite, and repeats — overnight, unattended, no human in the loop.

Each failure is analyzed to produce a representation of what went wrong, along with a hypothesis about its root cause. Failures are converted into reusable evaluation cases. The system proposes targeted changes to the agent harness, applies them, and validates the outcome against an evolving evaluation set. Every failure contributes to a persistent improvement rather than being resolved as a one-off fix.

The Self-Improvement Loop

Simulate batch of production trafficFrom benchmark real-world request distribution

↓

Phase A — Failure MiningScan traces · Classify root causes · Surface dominant patterns

Scan execution traces from failed tasksExtract structured failure records — what failed, why, and what the agent should have done differently

↓

Phase B — Eval Candidates & ClusteringTrack failures · Cluster by shared root-cause · Rerank by recurrence & severity

Group failures by shared root-cause mechanism into clustersPrioritize by high failure count and low resolution rate — optimization happens at the cluster level

↓

Phase C — Optimization Loop[fixed iteration budget]

Analyze failure patterns · Propose & implement harness changeTargeted fix addressing root-cause cluster

↓

Regression Gateregression ≥ 80% & val_score ≥ best_seen

FAILTry different approach

PASSFailures resolved

↺ retry loopback to analyze

Budget exhaustedexit anyway

↓

Phase D — Regression Set MaintenancePromote resolved failures to regression suite · Outcome recorded · Batch advances

Promote resolved failures into regression suiteEach fix becomes a permanent constraint — future changes must keep passing them

↓

Next batch ↺

How Each Phase Works

Phase A — Failure Mining. After each batch, the system scans execution traces from failed tasks and extracts structured failure records. It answers the central questions: what is the root cause of each failure? What failure patterns keep recurring? What should the agent have done differently? No manual labeling is required.

Phase B — Clustering & Prioritization. Failed tasks are grouped by shared root-cause mechanism into clusters. High failure count and low resolution rate identify the most systemic and unaddressed failure modes. Rather than treating failures independently, the system tracks and prioritizes them at the level of underlying patterns — enabling more efficient coverage of the error space.

Phase C — Optimization Loop. Within a fixed iteration budget, the system proposes targeted harness changes addressing root-cause clusters. Changes span the full stack: prompt design, few-shot examples, tool interfaces, context construction, and workflow architecture. Each proposed change runs through a three-step gate:

Regression suite — Previously fixed tasks must keep passing at ≥ 80% threshold. This is the system’s memory.
Full benchmark — Mean reward on the held-out test split must meet or exceed the best score on record.
Suite promotion — If both gates pass, newly-passing tasks are promoted into the regression suite.

Nothing is committed without clearing all three steps. If any step fails, the change is reverted and the system tries a different approach.

Phase D — Regression Suite Maintenance. The regression set is not a static benchmark — it’s a living collection of cases that evolves with the agent. Resolved failures are permanently encoded into the suite. Each improvement cycle makes it harder to accidentally regress, forcing each subsequent improvement to be genuinely additive.

Results

SigmaLoop ran completely autonomously for 18 batches, executing 96 harness experiments with a fixed GPT-4 model. No fine-tuning, no model upgrade — all gains come purely from agent harness improvements. The underlying model is intentionally fixed to isolate gains from the system itself.

0.560

Baseline val_score

0.780

Final val_score

+39.3%

Improvement

85%

Rejection rate

Agent Performance

kept

discarded (val score not improved)

discarded (reg gate failed)

Agent performance on the validation set improves from 0.56 → 0.78 over 96 iterations of harness optimization. At each iteration, the system explores multiple candidate updates, retaining only those that both improve validation performance and satisfy the regression gate (≥ 80%). In later stages, most candidate changes are rejected as the regression gate prevents any update that reintroduces previously fixed failure modes. As experiments progress, the optimization problem becomes harder, forcing each improvement to be additive — shifting reliability from a manual debugging loop to an automated improvement process.

Failure Cluster Discovery

Cluster Resolution Timeline

CLUSTER	2	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	STATUS
Wrong order identification		+1	+1	50%	2	2	2	2	✓								Fully resolved
Product variant mismatch	+1			1	1	1	1	✓									Fully resolved
Roaming/data limit handling		+1			1	1	1	✓									Fully resolved
Cabin downgrade payment confusion						+1			✓								Fully resolved
Cheapest flight not selected							+1	✓									Fully resolved
Tool dispatch errors							+1	1	1	1	1				✓		Fully resolved
Insurance scope misapplied								+1	1	1	1	1	1				Active
Device reboot sequencing									+1	1	1	1	1				Active
State tracking gaps										+1	1	1	1	1			Active
Multi-order context confusion											+1	1	1	1	1	✓	Fully resolved

total failures

+1 new failure detected

50% partially resolved

✓ fully resolved

SigmaLoop automatically discovered 29+ distinct failure clusters from execution traces, without any manual labeling. Failures are treated as recurring patterns rather than isolated incidents. As clusters are resolved, they are incorporated into the regression set, preventing recurrence. High-impact failure modes are systematically identified, prioritized, and driven toward resolution.

Regression Set Evolution

suite size (test cases)

regression gating for discarded iterations

The regression suite grows from 0 to 17 test cases across 18 batches, with each resolved failure cluster contributing new cases. The ≥ 80% gate is enforced throughout, rejecting any iteration that regresses on known failures. The evaluation set is not static — it evolves with the system. Each fix becomes a permanent constraint, making future improvements harder but more reliable, and ensuring progress compounds without backsliding.

Design Principles

One file to edit, everything else locked. The coding agent only touches agent/agent.py. The benchmark runner, gate logic, and recording infrastructure are immutable from the agent’s perspective. This separation gives the loop a stable contract: change the agent, measure the change, gate the change.

The regression suite is a memory of every fix. Each committed improvement promotes newly-fixed tasks into suite.json. Future changes must keep passing them — or get rejected. The suite grows tighter over time, functioning as a living proxy validation set.

Learnings survive session boundaries. After every iteration, the agent writes to workspace/learnings.md: what it tried, what the failure pattern was, what worked. Reading this at session start restores full context without re-running diagnostics.

High rejection rate is the goal. The gate’s job is to catch changes that overfit or regress. If the acceptance rate is high, the gate is too loose. In this experiment, 85% of changes were rejected — and every accepted commit was a real, generalizing improvement.

Benchmark-agnostic architecture. While the reference implementation uses tau2, the loop, gate, and recording layer are entirely decoupled. Subclass BenchmarkRunner, return {task_id: reward}, and the rest works unchanged.

Agent improvement is a search problem with many rejected paths. SigmaLoop provides the infrastructure — the loop, the gate, and the memory — so the agent can navigate that space autonomously.