Devlery
Blog/AI

AI Passed the CAPTCHA, but 0.88 AUC Caught the Process

Roundtable and an arXiv paper argue that AI agents can match CAPTCHA answers while still revealing themselves through click order and behavior.

AI Passed the CAPTCHA, but 0.88 AUC Caught the Process
AI 요약
  • What happened: Roundtable published a research note on May 28, 2026 comparing how AI agents and humans solve CAPTCHA-like tasks.
    • The linked arXiv paper reports an average AUC of 0.88 for a process-feature classifier on CogCAPTCHA30.
  • Why it matters: Click order, direction changes, and overselection can be stronger bot-defense signals than final answer accuracy.
  • Context: The study says frontier agents such as Claude, GPT, and Gemini were not always closer to human process traces than smaller models.
  • Watch: Public behavior signals can become training targets, and the paper names limits around cross-task generalization and sample scope.

Roundtable Research published CAPTCHAs can still detect AI agents on May 28, 2026. The post points to the primary paper, Process Matters more than Output for Distinguishing Humans from Machines, submitted to arXiv on May 7 and revised on May 9. Milena Rmus, Mathew D. Hardy, Thomas L. Griffiths, and Mayank Agrawal argue that a CAPTCHA answer is no longer the most useful signal. The stronger signal may be the order, hesitation, correction, and error pattern that appear while the task is being solved.

That changes the question from "Can an AI identify traffic lights?" to "Can a browser agent leave a human-looking interaction trace while creating accounts, buying tickets, filling surveys, or submitting payments?" Roundtable separates the final answer as output from the click sequence and related behavior as process. In a 30-task battery called CogCAPTCHA30, the paper reports an average process-feature classifier AUC of 0.88.

Roundtable CAPTCHA results chart

Getting the answer is not the same as solving like a human

Roundtable starts from a claim most security teams already know: modern vision-language models can identify chimneys, hydrants, and traffic lights. Image classification CAPTCHAs stopped being cleanly "AI-hard" after deep learning made object recognition cheap and broadly available. The paper also cites earlier work showing that current AI systems can solve several CAPTCHA variants at near-human accuracy.

The authors add a distinction between output equivalence and process equivalence. Two systems can choose the same tiles without using the same cognitive or interaction process. In Classic CAPTCHA and Cross-Tile CAPTCHA tasks, the paper says human and frontier-agent performance scores were not statistically distinguishable. Process-level features such as click direction, side bias, and row or column exploration still separated humans from agents.

CAPTCHA is the entry point, not the whole experiment. The researchers built CogCAPTCHA30 by combining one CAPTCHA task with 29 cognitive tasks covering decision-making, memory, perception, planning, and reasoning. Each task records not only the final score but also the path to that score: search pattern, sensitivity to outcomes, trial-by-trial adaptation, repeated choices, and responses after mistakes.

30
CogCAPTCHA30 tasks
0.88
Average process-feature AUC
150
Frontier-agent full-battery runs

The target was a browser-using agent

The experimental setup matters because it resembles the way current agents operate on the web. Human data came from 100 Prolific participants, with 97 retained after excluding runs with platform issues. On the AI side, the study compared three frontier vision-language agents: GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5. Each model completed 50 full-battery runs, creating 150 agent runs.

Agents used the same browser-based interface as humans. On each trial, an agent received a screenshot and returned a structured JSON action such as target coordinates, a tile index, or a categorical choice. A Playwright execution layer translated those actions into DOM clicks, key presses, and input events. This is different from a text-only benchmark. It is closer to a product agent looking at a page, deciding what to do, and sending browser events.

That is the developer-facing reason this paper is worth reading beyond CAPTCHA. In 2026, browser agents can fill forms, operate checkout flows, use support portals, and touch internal admin tools. Once screenshots, JSON actions, and Playwright events are connected, a user-agent string or IP reputation score is not enough to separate a human and an automated actor inside the same Chrome session. Roundtable's question is whether click order and behavior after errors provide additional evidence.

CogCAPTCHA30 measures the trace, not just the score

The paper's abstract says process-level features were stronger discriminators than task performance alone. Task performance means output measures such as accuracy or earned points. Process features include choice order, repeated actions, response after an error, exploration style, and adaptation across trials. The claim is not that agents cannot reach the same score. It is that they often leave different traces on the way there.

In the CAPTCHA comparison shown in Figure 1, the paper discusses sequential click patterns, direction changes, and overselection. A person scanning a grid may revisit uncertain tiles, hesitate around a partial object boundary, or move across rows and columns in an uneven order. An agent receives a screenshot and prompt, calculates coordinates, and sends clicks through Playwright. The final selected tiles can match while the interaction trace remains different.

Roundtable calls this a "Process Turing Test." The phrase is a product-friendly label, but it points at a practical security question. Bot defense does not need to decide whether a system has humanlike intelligence in a philosophical sense. It needs repeatable signals that distinguish automation from people in a specific workflow, especially when that workflow changes state or creates economic risk.

Bigger models were not automatically more humanlike

The model comparison is less intuitive than a simple capability ranking. Roundtable summarizes the result by saying state-of-the-art frontier models such as Claude, GPT, and Gemini were less similar to human process features than smaller models including Qwen and Centaur. The arXiv paper also says Claude Sonnet 4.5 and GPT-5 sat far from the human process axis, while the smallest evaluated off-the-shelf model, Base Qwen 1.5B, was closer on average.

That does not mean smaller models are better agents. The measurement here is distance from human behavioral process, not task success. A highly capable model may find answers quickly and consistently, which can reduce the mistakes, hesitation, limited working-memory effects, and trial-by-trial adjustments that humans show. The detector is comparing the session against the behavior distribution of the 97 retained Prolific participants, not ranking raw intelligence.

Centaur is a useful comparison point. The paper describes it as a 70B model fine-tuned on 10.7 million human decisions from more than 160 cognitive experiments. Roundtable says Centaur aligned better with human process-feature space than several other models. That result is not surprising: a model trained on human choice data should look more human in cognitive tasks. It also warns defenders that attackers with enough process data can train toward the same signals.

Fine-tuning worked, then weakened outside its target

The researchers also fine-tuned Qwen2.5-1.5B-Instruct in two ways. A-SFT, or action-level supervised fine-tuning, imitates individual human actions. P-SFT, or process-level fine-tuning, directly optimizes task-level behavioral process features. When the evaluation target matched the supervised feature, the paper reports that P-SFT moved closer to human behavior than A-SFT and Centaur.

The gain had limits. Figure 2 in the paper says P-SFT lost much of its advantage when some process features were hidden from the optimization objective and used only for evaluation. In cross-task evaluation, where the system had to generalize from one task's process representation to another, P-SFT's advantage shrank or fell back toward A-SFT. Roundtable's post describes the same pattern: hide part of the feature space or require cross-task generalization, and the gap reappears.

For security teams, that is the operational lesson. If a site publicly relies on click speed, direction changes, and tile-reclick ratio, an attacker can fine-tune against those features or add rule-based jitter. If verification tasks rotate, feature definitions stay private, and the system crosses multiple cognitive tasks, imitation becomes more expensive. The paper does not prove a permanent CAPTCHA defense. It shows that current agents still leak measurable process differences under this setup.

Web services now need an agent policy, not just bot blocking

Blocking all automation will age badly as agent traffic becomes normal. Travel booking, grocery ordering, internal workflow automation, customer support, and developer portals are all moving into agent workflows. The policy question is no longer only "Is this AI?" It is "Which agent, acting for which user, can perform which action under which limit?"

Read-only actions such as price lookup or shipment status checks can be exposed to low-risk agents. Account creation, payments, bulk ticket purchases, refunds, and permission changes need stronger human verification and rate limits. CAPTCHA and behavior verification sit at that boundary. Passing a single puzzle is less informative than whether the full session resembles the constraints and mistakes of a human user.

Development teams also need a logging design. Click coordinates, pointer movement, focus changes, dwell time, form edit count, error recovery, challenge retries, and device signals cannot be stored forever in raw form without privacy and compliance costs. Roundtable's process-feature framing is best read as a product design problem: decide which signals are necessary for human-machine discrimination, minimize the raw data footprint, and audit false positives.

reCAPTCHA and Turnstile pressure moves toward behavior

The May 28 release also fits Roundtable's product positioning. Roundtable is building Proof of Human, an invisible authentication system. Publishing CAPTCHA research helps the company argue against a puzzle-first model and toward frictionless behavior signals. The competitive set includes Google reCAPTCHA, hCaptcha, Cloudflare Turnstile, and FingerprintJS.

The research result should not be treated as a direct production benchmark. The paper compares specific agents on a controlled task battery. Real traffic includes mobile browsers, accessibility tools, VPNs, shared devices, keyboard-only users, screen readers, low-end hardware, and network jitter. Behavior-based verification can be a strong signal, but it can also penalize users whose input methods do not resemble the dominant training sample.

Operators should connect behavior scores to risk scoring, step-up verification, human appeal paths, allowlists, and bot identity verification instead of a single hard block. Services that want to allow legitimate agents also need to separate "an agent pretending to be human" from "a signed agent acting inside delegated authority." CAPTCHA research is mostly about the first category. The second category needs identity, authorization, and explicit user delegation.

Community reaction focused on the moving target

On May 29, 2026, the Roundtable post reached Hacker News under the title CAPTCHAs can still detect AI agents. It was also shared to Reddit's r/Futurology the same day. The early reaction was less triumphalist than practical. Some commenters read modern CAPTCHA detection as already depending on surrounding behavior rather than the puzzle itself. Others argued that once behavior signals are public, models can be optimized to mimic them, turning detection into a continuous back-and-forth.

That skepticism matches the paper's own limitations. The authors write that process-level supervision can improve human-behavior imitation, but that it depends on having the right task-specific process representation. They also name constraints: a 1.5B model in the fine-tuning study, structured sequential decision-making tasks, discrete low-cardinality action spaces, and a single human participant pool. The claim is narrower than "CAPTCHA is solved again." It is that present-day agents still show measurable process differences.

Security teams should split the problem into three tests. First, does the signal catch off-the-shelf agents? Second, does it survive an attacker who knows the signal and trains against it? Third, can it do that without creating unacceptable accessibility, privacy, or legal risk? Roundtable's paper mostly works between the first and second questions. The third only appears in real deployment.

What AI product teams should take from it

For teams building AI agents, this paper is a signal from the other side of the interface. A demo that solves CAPTCHA or manipulates a form is not enough. Services can add process-level detection and flag clicks that are too precise, too consistent, or too far from human error patterns. If an agent is acting with a legitimate user's authority, it is safer to prove identity and delegated scope than to imitate a human session.

For security and platform teams, the next task is acceptance policy. Decide which endpoints allow agent traffic, which requests require human-in-the-loop confirmation, and which actions trigger stronger verification. A read-only API and a payment, permission-change, or bulk-purchase API should not share one generic bot policy. That setup blocks useful agents and still misses harmful ones.

For data teams, the open problem is drift. A behavior detector trained against GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5 can change when model releases, browser automation stacks, or attacker fine-tuning strategies change. Store more than AUC. Track false positives, false negatives, appeal rates, accessibility impact, replay attempts, and performance on newly released agents.

Roundtable's message is not that CAPTCHA has returned as a silver bullet. The web is starting to ask not only what an AI agent can do, but how it did it. An agent can choose the right tiles and still leave a click sequence or error pattern that differs from human behavior. How long those traces stay useful depends on the next round of experiments between attackers and defenders.