AI Agent Loop Engineering: Why the Smartest Developers Stopped Writing Prompts and Started Writing Control Systems

Boris Cherny, head of Claude Code at Anthropic, described the shift in one sentence at the 2025 AI Engineer World’s Fair: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.”
That same week, Docker founder Solomon Hykes stood on the same stage and offered the cleanest definition of an AI agent yet: “An agent is an LLM wrecking its environment in a loop.”
Both statements point to the same reality. The leverage point has moved. Crafting the perfect prompt matters less than designing the control system that decides when to prompt, what to feed in, and when to stop. The unit of work is no longer the conversation turn. It is the loop.
This article covers the loop patterns that actually work in production, the failure modes that kill autonomous agents, and the architectural decisions that separate reliable loop systems from expensive toys.
What an agent loop actually is
An agent loop is a control structure that wraps an LLM call in a cycle of reasoning, action, and observation:
while not done:
context = assemble_state(memory, tools, goal)
action = llm.generate(context)
result = execute(action)
observation = interpret(result)
memory = update(memory, observation)
done = check_completion(goal, observation)
The LLM does not control the loop. The loop controls the LLM. That distinction separates a chatbot that sometimes gets stuck from an autonomous system that runs until the work is done.
Simon Willison tracks agent definitions on his blog. He points out that the “tools in a loop” pattern, which Anthropic formalized in their agent architecture, merges with a much older definition from academic AI: “An agent is something that acts in an environment; it does something. Agents include worms, dogs, thermostats, airplanes, robots, humans, companies, and countries.”
The worm does not decide when to stop being a worm. Its loop runs until external conditions say otherwise. Production agent loops follow the same principle.
Five loop patterns, ranked by autonomy
The pattern you pick determines how much trust you place in the LLM versus the control system.
Pattern 1: ReAct (Reasoning + Acting)
The default. The agent reasons about the current state, picks a tool to call, observes the result, and reasons again. Built into every major agent framework.
Think: "I need the current weather to answer this."
Act: call_weather_api("New York")
Observe: {"temp": 72, "condition": "clear"}
Think: "72 degrees is mild. I can answer now."
Act: respond("It is 72 and clear in New York.")
ReAct works well for interactive tasks with clear endpoints. It fails when the agent needs to sustain effort across many steps without drifting. The self-assessment problem kicks in around step 8 or 9: the LLM decides it is “done enough” and exits, regardless of whether the task is actually complete.
A 2026 paper from Alibaba’s research team analyzed this failure mode directly: “The self-assessment mechanism of LLMs is unreliable. It exits when it subjectively thinks it is ‘complete’ rather than when it meets objectively verifiable standards.”
Pattern 2: Reflection loop
The agent generates output, critiques its own output, then revises. Repeat until the critique step finds nothing worth changing.
This works for bounded creative tasks: writing, code review, document editing. It fails catastrophically for tasks that require external verification. An LLM cannot reliably judge whether its own SQL query returns correct results. It can only judge whether the query looks right.
The 2026 paper “A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents” (arXiv: 2605.20173) introduces the concept of the Stochastic-Deterministic Boundary, or SDB. It is a four-part contract governing how LLM output becomes a system action. Every loop must define a proposer (who suggests the action), a verifier (who checks it), a commit step (who executes it), and a reject signal (what happens on failure). Reflection loops collapse proposer and verifier into the same model. That is the failure point.
Pattern 3: Ralph Loop (external verification)
Named after the “Ralph Wiggum” technique from the Whilly orchestrator project, the Ralph Loop solves the self-assessment problem by moving verification outside the LLM entirely.
while true:
spawn fresh agent turn (clean context, no history bloat)
agent reads: PRD, previous learnings, prior state (from filesystem)
agent works: one discrete task per iteration
agent writes: results + learnings to filesystem
external gate: run tests / check completion markers / verify outputs
if all items verified complete → break
else → loop continues with fresh context
Each iteration gets a clean context window. No accumulated conversation history. No token bloat. The filesystem is the memory.
| Dimension | ReAct | Ralph Loop |
|---|---|---|
| Who decides to stop | LLM self-assessment | External verification gate |
| Context per iteration | Growing conversation history | Clean per iteration |
| Memory mechanism | In-context (prompt) | Files on disk (PRD, learnings, state) |
| Best for | Interactive chat, Q&A | Batch work, overnight automation, CI/CD |
| Failure mode | Premature exit | Infinite loop (mitigated by iteration cap) |
The Ralph Loop is what powers most autonomous coding agents that run overnight. Claude Code’s autonomous mode, Whilly, and Ouro Loop are all Ralph Loop variants under the hood.
Pattern 4: Structured Graph Harness (SGH)
A 2026 paper “From Agent Loops to Structured Graphs” (arXiv: 2604.11378) identifies three structural weaknesses in implicit-loop designs and proposes replacing them with explicit static DAGs.
The first is implicit dependencies. Step B depends on step A, but the loop does not know that. It discovers dependencies through trial and error. The second is unbounded recovery loops. When step 7 fails, the agent retries steps 4 through 7, then 3 through 7. There is no structured rollback. The third is mutable execution history. The agent rewrites its own memory of what it did, and traceability is gone.
SGH separates control flow into three layers. The planning layer generates an immutable DAG of tasks with explicit dependencies. The execution layer walks the DAG, running each node exactly once. The recovery layer, on failure, follows a strict escalation protocol: retry, then degrade, then escalate to human.
The DAG is static within a plan version. To change the plan, you create a new version. Every execution is reproducible and auditable.
Pattern 5: Multi-agent orchestrated loops
When one loop is not enough, you compose them. The orchestrator pattern runs multiple specialized agents, each with their own loop, coordinated by a central scheduler.
The KDnuggets 2026 taxonomy identifies four production-ready configurations:
| Configuration | Structure | Use case |
|---|---|---|
| Sequential pipeline | Agent A output → Agent B input | Data processing, document generation |
| Parallel gather | N agents run concurrently, results merged | Research synthesis, code review |
| Manager-controller | Central state graph, agents as workers with checkpointing | Complex software projects |
| Reviewer-critic | Generator + independent critic in a feedback loop | Content quality, security audit |
The manager-controller pattern is the one that matters for long running work. It adds state checkpointing: persistent snapshots of the full agent state at decision boundaries. A crash mid run does not lose everything. You resume from the last checkpoint. This is how production systems survive runs that take hours or days.
The six loop engineering primitives
Cobus Greyling’s loop-engineering reference distills production loop systems into six composable building blocks.
Automations and scheduling handle discovery and triage on a cadence. A cron trigger fires, an agent loop picks up the work, processes it, and reports. Claude Code’s autonomous mode works this way: it polls for tasks, works through them, and surfaces results.
Worktrees enable safe parallel execution. Git worktrees isolate filesystem mutations so multiple agent iterations can run concurrently without stepping on each other. Without isolation, parallel loops produce merge conflicts the agent cannot resolve. This is not optional.
Skills are persistent project knowledge stored as markdown files. They load at loop start, get updated during execution, and persist across sessions. They are the loop’s long term memory, externalized to the filesystem.
Plugins and connectors provide standardized tool access through MCP, the Model Context Protocol. The loop calls the MCP server instead of knowing how to call a database directly. Tool implementation and loop design stay decoupled.
Sub-agents enforce a maker-checker split. One agent proposes changes; a separate agent verifies them. The verifier has no stake in the proposal being correct, which prevents the self-assessment problem at the architectural level.
Memory and state form the durable spine outside any single conversation. Three tiers: core memory always loaded at roughly 1,300 tokens of critical facts, searchable memory backed by SQLite with FTS5 full-text index for past decisions and outcomes, and vector memory for semantic retrieval by conceptual similarity. The loop decides when to retrieve, not the human. Relevant context surfaces automatically at decision points.
Memory architecture: what persists between iterations
Nobody talks enough about memory in loop systems. Every iteration starts with a prompt, and what goes into that prompt sets the ceiling on what the agent can do. Load the full conversation history and you hit context limits by iteration 20. Load nothing and the agent has amnesia.
Production systems split into three tiers:
| Tier | Storage | Contents | Loaded when |
|---|---|---|---|
| Core | Inline in system prompt | Goals, constraints, user preferences, current task | Every iteration |
| Episodic | SQLite + FTS5 | Past decisions, outcomes, error patterns | Keyword-matched on demand |
| Semantic | Vector store | Conceptual knowledge, documentation, API specs | Similarity search at decision points |
The 2026 AICL paper (Artificial Intelligence Control Loop) formalizes this idea as “stability budgets.” Each memory tier gets a token budget, and the loop allocates tokens based on what the current state needs. When the budget is tight, recent failures get priority over ancient successes.
Hermes Agent ships a concrete implementation of this. Its core tier holds MEMORY.md and USER.md, roughly 1,300 tokens combined, always in context. The searchable tier indexes past conversations by keyword. The external tier, pluggable with Honcho or Mem0, handles long term user modeling. At decision points the system nudges itself: “You made a similar architectural decision three weeks ago. Here is the context.”
The failure modes that kill production loops
Here are the failure modes I see in production, in rough order of frequency.
The first is premature exit, and it is everywhere. The LLM declares the task complete because the output subjectively looks good. This is the default failure mode for ReAct loops. You never let the LLM decide when to stop. Tests, file checks, and diff validators make that call.
Then there is infinite oscillation. The agent bounces between two tools without moving forward. Call weather API, call it again, call it again. A dumb counter catches this: same tool fired more than N times in a row, force a reasoning step. It does not need to be clever.
Context collapse creeps up on you. The conversation history grows, the model loses coherence around iteration 20 or so, and suddenly the agent is responding to prompts from an hour ago as if they just happened. Fresh context per iteration fixes this. The filesystem is memory.
Replay divergence is the subtle one. You replay a deterministic execution log through a different model version, and the LLM produces different outputs. Goodbye reproducibility. The SGH pattern handles this with immutable execution plans and versioned DAGs, but most teams do not discover they need this until they get burned.
Verification blindness happens because the verifier and proposer share a model. Same blind spots. Use a different model for verification, or at minimum a different temperature.
And then cost runaway. No budget cap on a loop and it burns tokens until a platform limit kicks in or the task completes, whichever comes second. Usually the limit. The AgentBudget project on GitHub tracks real time dollar spend per agent loop. If you are running loops in production, you need caps per loop and per task. Not optional.
The phased rollout: how to trust a loop
You do not deploy an autonomous loop on day one. The loop-engineering reference defines three levels of trust, and you move through them slowly.
Level 1 is report only. The loop runs, observes state, and says what it would do. A human reviews the report and takes action. Zero risk. Over time, this builds real confidence in the loop’s judgment, not the fake kind you get from staring at a demo for five minutes.
Level 2 is assisted fixes. The loop proposes changes, a human approves each one. The loop does the work; the human is the gate. This is Claude Code’s default mode: agent suggests, you accept or reject.
Level 3 is unattended. The loop runs autonomously within boundaries you set: file paths, spending caps, time limits. It works until stopped. Claude Code’s autonomous mode and the Ralph Loop both operate here.
The progression from L1 to L3 should take weeks, not hours. Trust is earned per task, per environment, per loop. Rush this and you will ship something expensive.
When to use which loop pattern
| Your situation | Use this pattern | Why |
|---|---|---|
| Interactive chat, single-session tasks | ReAct | Lowest overhead, good enough for bounded work |
| Content creation, code review | Reflection | Self-critique works for tasks with internal quality signals |
| Overnight batch work, CI/CD agents | Ralph Loop | External verification prevents premature exit |
| Multi-step projects with complex dependencies | SGH (Structured Graph) | Explicit DAG prevents dependency chaos |
| Large projects with specialized sub-tasks | Multi-agent orchestrated | Parallelism + specialization |
| Production systems touching money or data | Manager-controller with checkpointing | Fault tolerance + audit trail |
Do not pick one pattern and apply it everywhere. Production systems compose them. A CI agent uses a Ralph Loop for autonomous work, spawns sub-agents for parallel checks, and escalates to a human when the verification gate fails three times consecutively.
What nobody tells you about agent loops
You still have to define what done means. Every loop system needs a human to set the verification criteria, the budget caps, and the completion conditions. The loop automates execution. It does not automate judgment.
The filesystem is better memory than the context window. Files do not drift. They do not bloat. They survive model switches and session restarts. Treat the LLM as a stateless function that reads from and writes to durable storage.
If your loop relies on the LLM to decide when to stop, it will stop early. Every time. The fix is not a better prompt. The fix is a test suite.
Start at L1. Run the loop in report only mode for a week. Watch what it would do. When you trust its judgment on 95% of decisions, move to L2. When L2 goes two weeks without a human override, consider L3. Skip this and you ship a bug that costs more than the loop saves.
The best agent loop knows when to hand off. Autonomy means reducing the human’s involvement to the decisions that actually require judgment. It does not mean removing the human.