Mythos reached ACE on 21 of 41 V8 bugs, Anthropic exploit benchmarks warn defenders

Anthropic published Claude Mythos Preview exploit evaluations and a CVD dashboard. V8 21/41 ACE and 1,596 disclosed flaws reset security triage expectations.

AI 요약

What happened: Anthropic published exploit-development evaluations for Claude Mythos Preview on May 22, 2026.
- The release covers ExploitBench, ExploitGym, and SCONE-bench.
The number: Mythos Preview reached ACE on 21 of 41 V8 CVEs in ExploitBench.
Disclosure at scale: Anthropic's CVD dashboard lists 1,596 vulnerabilities disclosed across 281 open-source projects.
- Of those, 97 had upstream patches and 88 had received a CVE or GHSA.
Watch: For defenders, patch triage speed, exploit reproducibility, and disclosure backlog now matter as much as model scorecards.

Anthropic Frontier Red Team published Measuring LLMs' ability to develop exploits on May 22, 2026. The post asks whether Claude Mythos Preview can move beyond vulnerability discovery into exploit primitives, sandbox escapes, and complete attack chains. Anthropic paired the benchmark release with a coordinated vulnerability disclosure dashboard that tracks whether open-source findings from Mythos Preview are turning into maintainer disclosures, patches, CVEs, and GitHub Security Advisories.

The evaluation target is different from a general model leaderboard. MMLU-style tests and coding benchmarks mostly ask whether a model solved a problem. ExploitBench and ExploitGym provide patched vulnerabilities, vulnerable builds, execution harnesses, time limits, and automated grading conditions, then check whether the model reaches unauthorized code execution. Anthropic has kept Mythos Preview out of general release and has routed access through Project Glasswing and the Cyber Verification Program. The May 22 report gives numbers for that deployment choice.

21/41

V8 CVEs where Mythos Preview reached ACE in ExploitBench

157

ExploitGym successes on the intended vulnerability

$35M

Simulated exploit revenue in SCONE-bench

1,596

Vulnerabilities disclosed in the CVD dashboard

ExploitBench is a V8 exploit-development benchmark from researchers at Carnegie Mellon University and Bugcrowd. V8 underpins Chrome, Edge, Android WebView, Node.js, and Electron applications, so a browser-engine exploit can travel through a large downstream dependency graph. The benchmark uses 41 patched V8 vulnerability environments and checks whether a model can progress from a proof of concept into exploit primitives and arbitrary code execution. Anthropic says it ran all Claude models through the same ExploitBench harness and shared results and transcripts with the benchmark authors for validation.

ExploitBench grades a capability ladder from T5 coverage to T1 full control. T5 means the model reaches the vulnerable code path. T4 means it reproduces the bug. T3 covers primitives inside the V8 sandbox, T2 covers out-of-sandbox read/write or an information leak, and T1 means control-flow hijack or arbitrary code execution. That ladder turns a familiar security distinction into a measurable score: making a tab crash and taking control of it are not the same event.

The headline number in Anthropic's post is 21. Across the baseline and nudged variants, Mythos Preview reached arbitrary code execution on 21 of 41 CVEs. Other general models reached zero ACE cases. The only other model on the scoreboard with any ACE result used a proprietary scaffold and reached 2 of 41. Anthropic says all tested models could often reach vulnerable paths or trigger bugs, but the capability cliff appeared when the model had to build V8 sandbox primitives and escape the sandbox.

Official ExploitGym successes chart

ExploitGym widens the target set. The benchmark, built by researchers from UC Berkeley, the Max Planck Institute for Security and Privacy, UC Santa Barbara, and Arizona State University, evaluates 898 patched vulnerabilities spanning OSS-Fuzz, V8, and the Linux kernel. A model receives vulnerable source code, build scripts, a vulnerability description, a compiled binary, a launch script, and a remote target. The objective is to gain code execution at a privilege level the target's security model should not allow and then retrieve a dynamically generated flag.

Those results point in the same direction. With a two-hour wall-clock limit and each developer's recommended harness, Mythos Preview achieved unauthorized code execution on 157 tasks through the intended vulnerability. If flag captures through unintended vulnerabilities are included, the count rises to 226. In the same table, Opus 4.6 is listed at 15 intended-vulnerability successes and 36 including alternative paths. That is less a small model-name change than a generational jump in pushing an exploit chain to completion.

SCONE-bench measures smart-contract exploitation. Anthropic updated the benchmark with MATS and the Anthropic Fellows Program and used 12 exploits reported after January 1, 2026, beyond the latest knowledge cutoff of all evaluated models. The model must identify a smart-contract vulnerability in a local simulation and write an exploit that extracts funds. Revenue is converted into US dollars using the CoinGecko historical exchange rate from the real exploit date.

Official SCONE-bench revenue chart

In that experiment, Mythos Preview recorded $35 million in simulated exploit revenue. Anthropic says the result was $15 million, or roughly 75%, higher than the next closest model. The gap was not driven by one or two outsized wins. Mythos Preview was the only model to exploit every tested vulnerability and separated itself on cases such as truebit and makina. Anthropic also describes previous models' revenue growth as a log-linear trajectory by release time, with the apparent doubling time shortening from 1.1 months to 0.7 months after Opus 4.5.

The release is not only useful to attackers. Anthropic attached a disclosure dashboard to the benchmark story. The company says that, beginning in February 2026, it used an early snapshot of Claude Mythos Preview to find open-source vulnerabilities, worked with external security research firms for triage and validation, and reported critical or high-severity findings to maintainers. As of May 22, 2026 at 10:27 PT, the dashboard listed 1,596 disclosed vulnerabilities across 281 projects, 97 upstream patches, and 88 findings with a CVE or GitHub Security Advisory.

CVD stage	Official count	Operational meaning
Candidate findings	23,019 findings	Model output can create a large candidate pool, but candidates are not publishable reports by themselves.
External review	1,900 findings reviewed	Human triage and reproduction become the bottleneck for responsible disclosure.
Validated	1,726 findings, 90.8% true positive	The dashboard notes this is a proxy based on external security review, not final maintainer confirmation.
Patched	97 upstream patched, 88 advisories	The gap exposes the backlog between AI-assisted discovery and maintainer remediation.

The dashboard also includes a disclosure ledger. Anthropic says it first publishes a SHA-3-512 hash of a sealed report for each validated finding, creating a possession proof. While the disclosure window is open, the dashboard hides the identifier, project, and bug class and shows only the hash and commitment date. After the window closes, it can reveal the CVE, GHSA, project, bug class, and finding details. The mechanism gives maintainers patch time while leaving a timestamped record that the report was not backfilled later.

The public advisory examples explain why security teams should read the dashboard as more than a model demo. Listed examples include an unauthenticated remote file write in the nginx WebDAV module, cross-namespace workflow manipulation in temporalio/temporal, and a path traversal issue in HashiCorp Nomad. The dashboard also lists Ghost Content API SQL injection, Mastodon SSRF and signature-bypass findings, and an ImageMagick heap buffer overflow. The label "AI-discovered" matters less than the project name, bug class, and whether the dependency sits inside a production patch queue.

Anthropic cautions readers not to treat the dashboard counts as a clean impact metric. The true-positive rate is based on manual review by external security research firms, not final maintainer confirmation. Some findings may have been previously reported, may fall outside a project's threat model, or may remain untriaged when maintainers request direct disclosure. That caveat is useful because security automation often overstates itself. A model can generate many candidates, but product security still requires reproduction, severity agreement, embargo handling, patch release, and advisory publication.

For developers, the practical change is patch prioritization. Many teams already combine SCA alerts, CVSS, exploit-in-the-wild status, and exposed-asset inventory. If models like Mythos Preview can quickly construct exploit primitives for patched vulnerabilities, the phrase "no public exploit yet" has a shorter shelf life. The pressure is highest for browser engines, kernels, parsers, image libraries, auth middleware, and other low-level components where one primitive can cross product boundaries.

AppSec teams face a second constraint: operating capacity. AI can increase candidate volume faster than triage staffing and maintainer response time increase. The dashboard's 23,019 candidates, 1,900 reviewed findings, 1,596 disclosed vulnerabilities, and 97 upstream patches are as much an operations story as a model-capability story. A team running a similar internal agent should design the review workflow before celebrating raw finding count: what was reproduced, what severity rule was applied, which disclosure SLA applies, and who owns the patch path?

Model-access policy is now part of the product discussion. Anthropic writes that Mythos-level models may become broadly available within the next 6 to 12 months. That is a forecast, but it connects directly to the company's deployment rules. The Cyber Verification Program is meant to block malicious cyber use more strongly while permitting defenders to protect their own software and infrastructure. For developers and security leaders, API access tiers, logging, abuse monitoring, and researcher-access programs are now part of the security posture, not paperwork around the model.

It would be wrong to summarize the release as "AI solved security." ExploitBench covers 41 patched V8 CVEs, ExploitGym runs inside experimental harnesses, and SCONE-bench uses local simulations. Real attacks add target fingerprinting, operational security, persistence, detection evasion, legal risk, patch-state variance, and environment-specific failures. It would also be wrong to dismiss the numbers as only a lab result. These benchmarks were built with vulnerable builds, randomized heap layouts, security-mitigation toggles, flag verification, and reproducible harnesses to narrow the gap between scoring and exploitation.

For Korean and global engineering teams, the immediate checklist is concrete. First, do not use raw AI finding count as a security KPI. Track reproduction rate, maintainer response, time to patch merge, advisory publication, and false-positive handling cost. Second, place browser engines, runtimes, parsers, image libraries, and auth middleware in a separate risk tier because exploit primitives in those components can become product-wide impact. Third, treat public-exploit status as a lagging indicator and add patched-vulnerability exploitability and benchmark reproducibility to prioritization inputs.

Anthropic's May 22 release is both a model announcement and a security warning. Mythos Preview's benchmark lead is a technical result for Anthropic. The 1,596 disclosed vulnerabilities and 97 upstream patches in the dashboard are a preview of how quickly defensive work can queue up. If AI lowers the expertise barrier for exploit development, the first defensive workflows to automate are not exploit generation. They are triage, patch routing, dependency ownership, and disclosure bookkeeping.