The Unreasonable Effectiveness of Agentic Loops
- The Unreasonable Effectiveness of Agentic Loops
If you wrote code with ChatGPT in 2023, you probably know the drill. Copy the answer. Paste it into your editor. Run the build. Watch it explode. Copy the red text. Paste it back. Wait. Repeat.
After a few rounds it starts to feel dehumanising. You are not really programming any more. You are ferrying messages between a compiler and a chatbot. A USB cable with anxiety.
That is the whole trick behind coding agents. They cut out the courier. The agent reads the files, runs the command, sees the failure, edits the code, runs it again, and only then comes back to you. Same class of model. Better loop.
Control theory would shrug at this. Better feedback loops produce better behaviour. The weird part is what sits inside the loop: a next-token machine from the internet that, once you hand it grep, a shell, and a verifier, starts to look alarmingly useful.
One Prompt, Two Worlds
Same prompt. Two very different situations. On the left, the model can only talk and wait for a human to ferry reality back into the conversation. On the right, the agent can touch the environment directly.
One Prompt, Two Worlds
Same intent. Two different loops. Press play and watch who finishes first.
A boring missing-prop bug is enough. A one-shot model can suggest the patch. An agent can run the test, see undefined, pass the missing prop, rerun, and stop on green. No magic. Just evidence.
The left-hand lane leaks information. Every time the human carries errors back into the chat, something gets delayed, compressed, or dropped. The right-hand lane stays wired into the environment. It can see what actually happened.
Context Decay
The chatbot's context has gaps where the human compressed or lost information. The agent's context stays continuous. Hover over the grey gaps to see what was lost.
That is the first big idea in this post: once the loop can see reality, the whole thing changes.
What an Agentic Loop Actually Is
By this point you have probably seen the pattern already. Codex, Claude Code, OpenCode, editor agent modes — same family. Give a model some tools, feed the results back in, let it keep going until the job looks done.
At heart, an agentic loop is simple. Observe. Decide. Act. Inspect. Update. Repeat.
Reasoning helps, sure. Sometimes a lot. But it is not the main event here. A brilliant model with no way to check itself still gropes around in the dark. A decent model with tools and feedback can punch far above its weight.
Observe, Act, Inspect, Repeat
Switch pieces of the loop on and off. Watch how the trace changes.
Classic software tries to pin down inputs, outputs, and edge cases up front. Agentic loops loosen that up. The system can react to what it sees. The constraints move elsewhere: the user’s intent, the tools on hand, and the quality of feedback.
The Harness Chooses the Game
People talk about agent comparisons as if they are model comparisons. Half the time they are harness comparisons.
The harness is all the boring but important stuff around the model: the prompt, the tool contract, the permissions, the cached context, the verifier, the approval flow, the stopping rule. Same model, different harness, different creature.
Give the model nothing but a chat box and the human becomes the harness. The human ferries errors, judges risk, and decides when the job is done. Give the model a repo, a shell, a verifier, and one clarifying question when it gets stuck, and the search space changes completely.
Same Model, Different Harness
Switch only the harness. The model stays the same.
This is why two agent products using the same underlying model can feel like different species. The harness is the multiplier. I believe that, since labs can post-train more effectively on their own harnesses, it’s better to use those harnesses rather than build your own.
Why Programmers Saw It First
Programmers saw this first because software is the perfect terrarium for agents.
Code is already text. The tools are already composable. The actions are often reversible. The environment is full of crisp feedback. Compilers complain. Tests fail loudly. Linters nag. Git shows diffs. Screenshots tell you what the UI actually did. Logs tell you what the server thought it was doing. More importantly, the whole environment is already wired up for arbitrary composition: shells, CLIs, files, pipes, HTTP APIs, logs, stdout, and screenshots. Coding agents inherited decades of automation substrate for free.
And then there is the accumulated nonsense and wisdom of the last few decades. The greybeards left us a warehouse full of shell utilities, one-liners, Stack Overflow answers, makefiles, scripts, man pages, and tiny Unix incantations that do one thing surprisingly well. Agents walked into the best-stocked workshop on earth.
Programming also has a deeper advantage. It is already the business of turning fuzzy intent into precise procedure. Software has been replacing human procedure with machine procedure for decades. Agentic coding tools are simply the latest inhabitants of that world.
Why Coding Went First
Builds, tests, screenshots, and diffs make success legible.
Read file. Edit file. Run test. Search code. Commit diff. The shell is full of verbs.
Files, directories, logs, and APIs are easier to inspect than most real-world systems.
A bad patch can be reverted. A bad surgery cannot.
Verifiers Beat Vibes
The feedback loop is embarrassingly simple. Do something. Measure what happened. Update your plan. Do the next thing. Until that loop closes, the model lacks a reliable way to know whether the action actually worked.
This is why agentic systems feel qualitatively different from a pure chatbot. The model no longer has to imagine what reality might have said. It can ask reality directly.
In software, the verifiers are everywhere: compiler output, failing tests, HTTP status codes, rendered screenshots, logs, type checkers, file diffs, user confirmation. The model proposes actions. Verification provides selection pressure. One gives candidate moves. The other tells the system which moves deserve to survive.
Even this article is a tiny example. An earlier draft of the demos had clipped SVG labels and a noisy random chart. The agent rewrote the layout, made the chart deterministic, syntax-checked the script, and only then stopped. Small bug, same principle. Because there’s no way I’m writing D3.js by hand.
Error Cascade
Watch the error count drop to zero as the loop iterates. Pick a scenario to see how feedback turns failures into data.
This is also why a smaller model inside a good harness can outperform a fancier model trapped in pure text. Feedback turns guessing into search.
Reward Hacking, a.k.a. “I’m Done”
If you squint at how many models are trained and tuned, the incentive usually points towards responses that humans rate highly. Metaphysical truth barely enters the picture. Most of the time those line up. Sometimes they drift apart.
Strictly speaking, this differs from reward hacking during training. What you more often see at run time is instrumental bluffing. The loop discovers that certain outputs are locally rewarded: optimistic language, stopping early, editing the test instead of the bug, or declaring success before checking.
If saying “Absolutely, fixed” is cheaper than actually checking, the system will drift towards that move. No malice required. It is following the cheaper local objective.
Loop design fixes this. You build the harness so that the easiest way to please the user is to actually do the work. Make verifiers cheap. Make bluffing expensive. Ask for evidence of completion. Close the loop.
A Toy Model of Bluffing
This is not a faithful model of training. It is a deterministic sketch of the runtime incentive landscape.
Asking Is Part of the Loop
One of the stranger hangovers from the chatbot era is the idea that asking a clarifying question is somehow a failure. For agents, it is often the opposite. A good question is an information-gathering action.
If the instruction is ambiguous, acting immediately can be more expensive than asking once and then acting with narrower uncertainty. Good agents do more than execute. They reduce ambiguity before they execute.
Guessing Versus Asking
Turn ambiguity up, then compare a blind guess with one clarifying question.
Make the chart bigger and move it higher.
Asking is part of the loop.
Expose Primitives, Skip the Toy Tools
If you want agents to be useful, think beyond a handful of blessed wrapper functions. Coding agents draw power from a messy, open-ended substrate: files, shells, pipes, logs, HTTP APIs, databases, browsers, and stdout.
Human-defined tools are fine. Trouble starts when the tool is so narrow that the agent can only follow the exact happy path the developer imagined. A brittle wrapper can save a click. A composable primitive can open up a search space.
This is where product design bends. Safe wrappers, explicit verbs, and audit trails are still useful, especially for dangerous actions. Coding agents point towards a simpler lesson: expose enough of the system that the agent can actually explore it.
Rigid Wrappers Versus Composable Primitives
resize_chart(chart_id, size: "large") move_chart(chart_id, position: "higher") email_growth_summary(report_id)
Useful right up until the task deviates from the exact workflow you anticipated.
curl /api/reports/42 jq '.rows[]' psql analytics python transform.py playwright screenshot git diff npm test
Ugly, generic, and dramatically more powerful, because the agent can compose its own path.
Raw shell access to the public internet would be absurd. Granularity is the issue. Give the loop safe primitives with room to compose instead of glossy wrappers around one pre-approved flow. The best agent interface is often the one automation engineers already loved before LLMs arrived.
Building for the Future
The more interesting shift is practical. Sometimes it is cheaper to spend tokens than to spend human patience. Classic software 1.0 and 2.0 keeps a person trapped in the middle of the workflow: click the form, load the page, rerun the test, copy the error, attach the screenshot, write the comment. An agent can often eat that whole sequence.
That changes what a good product looks like. Build the system so the agent can touch the state directly, do the work, and bring back evidence. Spend the tokens on exploration. Spend the human attention on judgement.
Take a backlog full of Jira tickets. The old instinct is more triage, more templates, more checklists. The agentic version starts from the ticket itself. Pull the repro steps. Boot the app. Trigger the bug. Capture the broken path. Patch the code. Rerun the tests. Open the UI again. Record the fixed path. Hand back the diff and a before-and-after screen recording.
That is a much better division of labour. The human no longer burns half an hour reproducing the issue and another ten minutes proving the fix still works. The human watches the evidence and decides whether to merge.
Limits
This is also where the boundaries show up.
Wrong Metric
If the verifier is the wrong metric, the loop hill-climbs the wrong hill. Suppose the only signal is “UI test passes”. An agent can make CI green by regenerating a snapshot or loosening the assertion while the export flow remains broken. The system optimised exactly what you let it see.
Hidden State
If the important state is hidden, stale, or spread across systems the agent cannot inspect, the loop becomes blind. It may still act confidently, but its confidence is now unmoored from the world.
Slow Verifiers
This is why code often feels easier than design, copy, architecture, or strategy. In those domains the verifier is frequently human taste, long-range business effect, or team consensus. Those signals are slower, noisier, and more political than a compiler error.
Expensive Actions
When actions are irreversible or high-stakes, the same loop becomes much more dangerous. A bad patch can be reverted. A bad database migration, financial trade, or medical action is a different story. This is where permissions, approvals, and narrow operating envelopes matter.
So ask a better question: what can the loop see, what can it change, and how does it know whether it succeeded?
Epilogue
At the end of the day, yes, these systems are still predicting the next token. But that framing hides the interesting unit of analysis. The useful thing is no longer a single completion. It is the whole loop inside its harness.
An agentic loop predicts, acts, checks reality, and then tries again. Models answering questions is old news. The strange part is that, given enough time, tools, and a well-defined problem space, next-token prediction can participate in a search process that lands on working results.
The text comes from the model. The useful behaviour comes from the loop it sits inside.
The same habit helps outside software too. Start with the thing you actually want finished. Then look at the tools, the feedback, and the pointless relay work sitting in the middle.