Automated Improvement Loops: Practitioner Findings from 500 Agent Runs

The cutting edge of AI practice is converging on a deceptively simple idea: put an agent in a loop.

Karpathy's autoresearch runs an agent overnight in a while loop over a training script. Modify, train, evaluate, repeat. Wake up to a hundred experiments and a better model. Geoffrey Huntley's Ralph Wiggum pattern does something similar for code generation: a bash loop feeds a prompt into Claude Code, the agent picks a task, implements it, runs tests, commits, and exits. The loop restarts with a fresh context window. OpenClaw is adding heartbeat and cron capabilities. Scheduled agent execution. Persistent loops with checkpoints.

The pattern is trivially simple. In some cases literally while :; do cat PROMPT.md | claude ; done. But the findings about what actually happens when you run these loops against real production systems are not simple at all. And almost nobody is publishing on the mechanics yet.

This article is a contribution to that gap. I have been running automated improvement loops against a multi-chain assessment system for compliance scoring. The system uses multiple LLM assessment chains, each evaluating a different compliance domain independently. Each chain runs through an orchestration workflow using Anthropic Claude models. The benchmark is a validation set of 27 real-world cases scored by human reviewers across 400 fields.

Before the loop work started, manual iterative optimisation had taken accuracy from 47% to 80%. That took weeks. The last several points came slowly. At 80%, manual approaches had effectively stalled.

The question was whether an automated loop could break through.

What We Set Out to Test

Two hypotheses.

H1: An automated improvement harness, where an agent analyses its own failures against a human-reviewed benchmark, hypothesises changes, implements them, and evaluates results, can push accuracy past the 80% plateau toward 90%.

H2: Agent-generated hypotheses will produce more targeted improvements than manual analysis, because the agent can systematically evaluate a larger hypothesis space per cycle.

The acceptance criteria were concrete: accuracy measured as field-level agreement with human reviewers across the full 27-case, 400-field validation set.

What We Did

We ran 15 automated improvement cycles over two days. Each cycle followed the same structure: the agent analysed mismatches between its assessments and the human benchmark, hypothesised what was causing specific errors, implemented targeted changes to prompts and assessment configuration, then re-evaluated against the full validation set.

This produced roughly 500 agent invocations (each processing 25,000 to 100,000 characters of source material), 207 simulation outputs, and 18 bespoke scoring scripts.

The work fell into four phases, though we did not plan it that way. The phases emerged from what the data forced us to do.

We started with targeted calibration: narrow fixes addressing one to three specific cases per cycle, additive changes only. When we scaled from three chains to all seven, broad changes that tightened scoring criteria caused cascading regressions in unrelated fields. One cycle produced 13 regressions. We had to revert and rethink.

That forced a shift to pattern-level hypotheses, constraining the agent to detect specific failure modes rather than tighten general thresholds. This unlocked the session's largest improvement. When incremental changes stopped producing net gains, we shifted to structural experiments: a 2x2 matrix crossing two models with two assessment formats, to isolate whether our remaining problems were model limitations or prompt architecture limitations.

What We Found

1. The loop works, but not all the way

Overall accuracy moved from 80.2% to 85.5%. One assessment category went from 52% to 95%. The 90% target was not reached.

The early gains came fast. The first four cycles added 5+ percentage points on the original 14-case control set. By the final cycles, the regression-to-fix ratio had converged to roughly 1:1. Every fix created a new regression somewhere else. The loop had found the ceiling for this architecture.

2. Pattern-specific beats threshold-tightening. Every time.

This was the most consistent finding across all 15 cycles.

Adding detection for a specific failure pattern (for example, checking whether a particular type of claim was substantiated) produced reliable improvements with zero regressions. We could add these checks at any scale, to any chain, and they held.

Making the model generally more strict (raising thresholds, adding broad validation requirements) always caused cascading regressions across unrelated fields. Always. We tested this five times across different chains and different types of tightening. The pattern held without exception.

The practical implication: when designing agent improvement loops, constrain the hypothesis space to additive, pattern-specific changes. Do not allow the agent to tighten general thresholds. This is counterintuitive because threshold-tightening feels like the obvious lever when accuracy is too low.

3. Prompt structure affects cross-field interference independently of content

This is the finding I did not expect.

The system exhibited a persistent interference pattern: adding assessment criteria to one field caused regressions in completely separate fields on the same chain. Adding 579 characters of factual verification to one field caused six regressions on an unrelated field. We confirmed this five times.

Our first hypothesis was that this was a model capability limitation. A more capable model would handle the additional complexity without interference. We tested this by swapping to a larger model. It reduced the interference from six regressions to three, but did not eliminate it.

Then we tried a structural change. Instead of open-ended assessment instructions ("evaluate whether the excess was disclosed correctly"), we used a structured Y/N checklist ("Q1: Was the amount stated? Q2: Was the amount correct? Q3: Was the frequency correct?"). Same assessment criteria. Different format.

The interference dropped from six regressions to one.

The format of the instructions affects cross-field interference independently of the content. A checklist confines the model's reasoning to mechanical steps, preventing what we started calling "assessment posture bleed." The effect is almost like the model developing a mood: when you make it stricter in one area, it shifts its overall disposition and starts scoring more harshly across everything, even fields that have nothing to do with the change you made. A structured checklist breaks that generalised stance by forcing the model to evaluate each question mechanically rather than holistically.

This finding generalises. Any multi-field LLM assessment system is likely susceptible to this interference pattern, and the mitigation is structural, not about model capability.

4. "Just use a better model" is not a general solution

We tested two models across two chains where the smaller model exhibited persistent interference effects. The larger model reduced but did not eliminate the problem. In one case it made no net difference at all.

This matters because "upgrade the model" is the default advice when accuracy stalls. Our data shows that some accuracy limitations are architectural, not capability-based. A better model operating on a flawed prompt structure hits the same structural ceiling, just slightly higher.

5. Emphatic formatting does not improve instruction adherence

We tried CRITICAL and DO NOT SKIP directives with bold formatting in three separate prompt additions. Zero measurable effect on model behaviour. The model did not execute the instructions any more reliably than identically-worded instructions without emphasis.

This is a small finding, but a useful one. Prompt engineers frequently reach for formatting emphasis when instructions are not followed. The data suggests this does not work, at least for the models we tested.

When to Stop: Convergence Detection in Automated Loops

Nobody is publishing on this yet, and it matters.

An automated improvement loop needs a stopping criterion. Without one, it will churn indefinitely, making changes that produce equal regressions, wasting compute and potentially degrading the system.

We found that the regression-to-fix ratio is a reliable convergence signal. In our first four cycles, the ratio was 9:0. Fixes only, no regressions. By the middle cycles, ratios were 7:6. By the final cycles, 10:9. When the ratio approaches 1:1, the loop has converged. Further iteration is not producing net improvement.

We also identified four categories of residual error that mark the structural floor:

Category	Description
Stochastic variance	Fields that fluctuate between runs due to model non-determinism. Not addressable via prompts.
Cross-field interference	Changes to one field cause regressions in another. Partially addressable via structural changes (checklists).
Data gaps	The assessment system lacks information needed to make certain judgements. An architectural limitation, not an accuracy one.
Benchmark misalignment	The human reviewers and the AI system categorise certain edge cases differently. Neither is wrong. The scoring comparison is structurally misaligned.

Once remaining errors fall predominantly into these categories, the loop has reached its practical limit for the current architecture. Further improvement requires architectural changes, not more iterations.

Limitations

This is one system, one domain, two models, 27 validation cases. The findings are consistent across 15 cycles and 500 invocations, but a broader study across different domains and architectures would strengthen the conclusions.

The validation set is benchmarked against human reviewers, who have their own inconsistencies. Some of the "errors" in the residual set may reflect legitimate differences in interpretation rather than AI failures.

We did not test other loop architectures (Ralph Wiggum's task-picking approach, autoresearch's keep/discard binary). Our loop was specifically designed for accuracy improvement against a fixed benchmark. Different loop designs may surface different findings.

What Comes Next

The obvious next question is whether the structural findings (checklist format reducing interference, pattern-specific over threshold-tightening) hold across other multi-field assessment domains. If the assessment posture bleed problem generalises, the mitigation likely does too.

There is also an open question about loop composition. Ralph Wiggum uses a planning loop followed by a building loop. What happens when you chain an accuracy improvement loop after a planning and building loop? Could you automate the full cycle: build, assess, improve, reassess?

And the convergence detection methodology needs testing in other contexts. The regression-to-fix ratio worked for accuracy improvement. Does it work for code quality loops? For training optimisation? The metric is simple enough that it should transfer, but that is a hypothesis, not a finding.

The loop pattern is arriving everywhere because it works. What's missing is the practitioner knowledge about how it behaves under pressure: when it accelerates, when it stalls, and what to do when it hits the ceiling. That's what this work is about.