Your AI Agent Doesn't Know When to Stop

AI agents now work in loops: generate an answer, check it, revise, repeat. Most stop the moment a check passes. New research shows that rule is statistically broken, and builds a wrapper that actually knows when an agent should ship.

Published May 20, 2026 · 9 min read

Based on Cho & Sun, 2026 — arXiv:2605.12947.

The setup

AI is moving from one-shot answers to workflows. An agent does not just respond once. It produces a draft, runs a check on it, reads the feedback, revises, and checks again. Coding agents, research agents, and tool-using agents all share this shape.

Every loop creates a decision that rarely gets discussed: when does the agent stop and ship the current answer? Each extra round costs time, tokens, and money. Stop too early and you ship something wrong. Stop too late and you burn budget on a task that was never going to work.

Two researchers at Purdue took this stopping decision seriously and treated it as a statistics problem. Their finding is uncomfortable for a lot of agent designs shipping today.

“The deployment problem is a stopping problem: the system must decide when the evidence accumulated along an adaptive trajectory is strong enough to justify releasing the current output.”
Cho & Sun, 2026

Why “stop when it passes” is broken

The obvious stopping rule is the one most teams use: keep looping, and release as soon as a check comes back clean. It feels safe. It is not.

The check is imperfect. It is a unit test on a sample of cases, a model grading another model, an execution trace. Sometimes a wrong answer passes anyway. Now run that imperfect check over and over. A wrong answer gets repeated chances to score well by luck. Wait long enough and one of them will.

Statisticians have a name for this. If you keep re-running a test and stop the moment you like the result, you will eventually get a result you like, even from pure noise.

“This is the workflow analogue of p-hacking. A rule that simply waits for a high score can be misled by repeated monitoring.”
Cho & Sun, 2026

The paper proves this formally: a rule that releases on the first clean check is, given enough rounds, almost guaranteed to release on a task it should have refused. Move the slider and watch the risk climb.

The cost of checking again

How many times does the agent check before it stops?

55%chance of a misleading pass

A task this agent can never actually solve still has a 55% chance of being handed at least one convincing “pass” to stop on after 10 rounds of checking. Every extra look is another roll of the dice. This is why “stop when a check passes” fails: the rule rewards persistence, not correctness.

Two reasons an answer is still wrong

To fix the stopping rule, the paper draws a line that most agent systems never make explicit. When a candidate answer is wrong, it is wrong for one of two very different reasons.

Two reasons an answer is still wrong

Infeasible: the task is beyond what this generator and verifier can do. It will never produce a reliably correct answer. But it can still rack up high-looking scores by overfitting the visible checks. Every release here is an error. The whole job of a stopping rule is to hold back on these.

The catch: from the score alone, the agent cannot tell which world it is in. A wrong answer on a feasible task and a wrong answer on an infeasible task can look identical.

A good stopping rule should be patient on feasible tasks and refuse on infeasible ones. The problem is that the raw score does not separate them. On a hopeless task, an answer can overfit the visible checks and look just as good as a real solution.

The fix: stop trusting the score, start counting evidence

The paper’s answer is a wrapper. It does not retrain the agent or the checker. It does not need to understand how they work. It sits on top of an existing pipeline as a decision layer and changes one thing: what counts as enough evidence to stop.

A pool of convincing failures. Offline, before anything ships, the team collects answers that were wrong but still scored well. These hard negatives become the benchmark for a high score that means nothing. You cannot judge a score without knowing how impressive a known liar can look.

The reference pool in stage one is the clever part. A raw score of 0.9 means nothing on its own. Is 0.9 impressive, or do wrong answers hit 0.9 all the time on this kind of task? You cannot know without a benchmark of convincing failures: answers that were confirmed wrong but still scored high.

Calibrate against that pool and a score finally becomes meaningful. A 0.9 that still beats every known liar is real evidence. A 0.9 that known liars also reach is worth almost nothing. Same number, opposite conclusion.

“The right question is not whether a candidate has a large raw score at a single step, but whether the evidence accumulated along the trajectory is strong enough to support release.”
Cho & Sun, 2026

Does it actually work?

The researchers tested this on a coding benchmark. An agent gets ten attempts per task. It sees a small set of visible tests as its checker. Real correctness is judged later by a larger hidden test suite it never sees. The numbers below compare the calibrated wrapper against the heuristics teams normally reach for.

Hopeless tasks the calibrated wrapper wrongly ships (α = 0.10)

77%

Hopeless tasks the confidence heuristic wrongly ships

23%

Hopeless tasks the score-stability heuristic wrongly ships

77%

Solvable tasks the wrapper still ships, on a correct answer

Zero is the number that matters. On tasks the agent could never solve, the wrapper released nothing. The confidence heuristic shipped more than three quarters of them. And the wrapper did not buy that safety by refusing everything: it still shipped 77% of the genuinely solvable tasks, every one of them on a correct answer.

Watch it decide

Averages hide the mechanism. Here are two real task runs from the study, step by step. One is a trap. One is a genuine solution that arrives slowly. Tap through the steps and watch the evidence move.

Evidence accumulated (tap a step)Release bar

Visible score

0.967

Actually correct

Evidence

1.19

The wrapper never releases on this task. Evidence tops out below the bar.

Every answer scores 0.967 on the visible checks. Every answer is actually wrong. Confidence-based and stability-based rules both release early, in error. The wrapper sees a score that is not extreme against known liars, so evidence creeps up but never reaches the bar. It abstains. Correctly.

The contrast is the whole idea. A persistently high score is not enough to stop, because liars score high too. A correct answer that shows up again and again is enough, because that pattern is hard to fake. The wrapper waits for the second thing.

The shortcuts people actually ship

Most agent systems in production today use one of the first three rules below. The study put each one head to head with the wrapper.

Heuristic

Stop when it sounds confident

Release once the model's confidence crosses a threshold. In the study it shipped 77% of hopeless tasks. Confidence is not correctness, and a stuck-wrong answer can be stated with total conviction.

Heuristic

Stop when the score settles

Release once the score stops moving and sits high. Shipped 23% of hopeless tasks. A score that is stuck high is very often just a wrong answer the verifier keeps rubber-stamping.

Half measure

Stop on one calibrated check

Use the calibrated score, but decide from a single step. Tuned loose it fires constantly, including on bad tasks. Tuned strict it can never fire at all. One check has no resolution.

The fix

Stop when evidence adds up

Calibrate every check against known failures, then accumulate. Zero releases on hopeless tasks, while still shipping 77% of solvable ones. Survives unlimited re-checking by design.

What this means for your business

The benchmark is about code, but the structure is everywhere. Any agent that loops generate, check, and revise faces this exact decision, whether it is drafting contracts, resolving tickets, researching, or writing software. If the stopping rule is “ship when a check passes,” the risk described here is already in your product.

If you are building or buying agentic systems, three concrete moves:

Treat a passing check as one data point, not a verdict. The more times an agent is allowed to re-check, the less a single pass should count. Repetition should raise the bar, not lower it.
Calibrate against your own failures. Keep the answers that looked good but turned out wrong. They are the only honest benchmark for what a meaningless high score looks like on your tasks.
Give your agent permission to abstain. Some tasks are beyond the current system. A stopping rule that can say “not good enough, do not ship” is worth more than one that always produces an answer.

The shift is small but it changes the question. Not did a check pass, but has enough real evidence built up. One is easy to fake. The other is what actually makes an agent safe to ship.

Your AI Agent Doesn't Know When to Stop

The setup

Why “stop when it passes” is broken

Two reasons an answer is still wrong

The fix: stop trusting the score, start counting evidence

Does it actually work?

Watch it decide

The shortcuts people actually ship

Stop when it sounds confident

Stop when the score settles

Stop on one calibrated check

Stop when evidence adds up

What this means for your business

Want help applying findings like this to your AI roadmap?