Arm open-sources Metis, a security agent aimed at SAST false positives

Arm has open-sourced Metis, an agentic AI security framework for code review, SARIF triage, evidence chains, and lower false-positive cost.

AI 요약

What happened: Arm open-sourced Metis on May 28, 2026 as an agentic AI security framework for code review and SAST triage.
- Arm says Metis is already used across more than 130 internal software projects and is targeted for Arm-wide software adoption by the end of 2026.
The claim: Arm's internal benchmark reported up to 10x higher true positive rate and roughly 50% fewer false positives than leading SAST tools.
Why builders should care: The practical center of Metis is not just vulnerability discovery. It is SARIF triage, evidence collection, and deterministic gates.
- AppSec teams should evaluate whether Metis can explain valid, invalid, and inconclusive findings with reproducible file and line evidence.

Arm open-sourced Metis on May 28, 2026. Arm calls it an "agentic AI security framework," but the product shape is closer to code review and SAST triage than a general-purpose security chatbot. Metis reads source code, build files, and documentation to build a project-specific knowledge base, then uses LLM reasoning to look for design flaws and logic vulnerabilities.

Three numbers define the announcement. Arm says Metis is already running across more than 130 internal software projects. The company is aiming for Arm-wide software adoption by the end of 2026. In internal benchmarks, Arm says Metis reached up to 10 times the true positive rate of leading static analysis tools while producing about 50% fewer false positives. Arm also says the benchmark was not used for AI training.

Metis should not be read as "AI replaces security review." The expensive part of AppSec work often begins after a candidate finding appears. A team has to decide whether the issue is real, which code path proves it, whether a third-party SAST finding can be closed, and whether the evidence is strong enough to send to a maintainer. Metis tries to shorten that bottleneck with RAG, Tree-sitter analysis, SARIF triage, and deterministic adjudication.

Official Arm Metis workflow

AI is targeting SAST's weak spot

Traditional SAST tools are strong at finding known patterns quickly. Buffer overflows, hardcoded secrets, injection sinks, unsafe API calls, and other rule-shaped problems can produce useful automated alerts. The operational cost comes from filtering those alerts. The same function call can be safe or dangerous depending on guard conditions, sanitizers, macro expansion, caller contracts, build configuration, and deployment context.

Arm says Metis looks beyond fixed rules by reading more repository context. The tool indexes a repository, retrieves related source and documentation, and lets an LLM reason about whether a vulnerability candidate is plausible. A 2025 Arm developer community post described Metis as learning intended behavior from "source files and documentation." The 2026 Newsroom post reframed the same structure as a RAG architecture.

The vulnerabilities that need this treatment usually do not fit on one line. An authorization check may live in a different layer. A parser may enter an unsafe path only in one state. A C or C++ macro may expand into a risky allocation route. A Rust and C FFI boundary may break an ownership assumption. A SAST rule can produce a candidate, but building the evidence chain and removing counterexamples is separate work.

Arm's claims about a 10x true positive rate and 50% fewer false positives attach to this triage problem. The caveat is important: these are Arm internal benchmark numbers. They are not from a public benchmark suite or a third-party reproduction. The safer interpretation is that Arm designed Metis to reduce the false-positive handling cost inside its own product security workflow, and says internal deployment has shown that value.

Metis looks more like a triage pipeline than a finding generator

The Metis GitHub README describes the project as "AI-Powered Security Code Review." Its CLI includes index, review_code, review_file, review_patch, triage, ask, and update. A team can review a repository, a single file, or a patch diff. It can also feed existing SARIF findings from other scanners into Metis for triage.

The README's feature list points in the same direction: deep reasoning, context-aware reviews, plugin-friendly extensibility, issue validation, and provider flexibility. The issue validation feature is especially relevant. Metis is designed to validate both its own findings and third-party SAST findings, gather evidence, and reduce false positives.

Metis's triage-flow documentation is more explicit. It defines triage as "deterministic orchestration, not an agent loop." Metis reads a SARIF finding, retrieves repository context from a vector index, and builds an evidence pack with Tree-sitter, line-local sed and grep, plus C and C++ macro and include resolution. It then asks an LLM for a structured decision, and deterministic gates plus adjudication rules assign the final status.

That design is a meaningful safety control for AI security tooling. An LLM saying "valid" is not enough to confirm a finding. If evidence obligations are missing, the result can be downgraded to inconclusive. If contradiction signals appear, it can be forced to invalid. Direct valid or invalid upgrades are blocked by invariant checks. For a security team, the model's confidence is less important than the scope of evidence collection and the rules that turn that evidence into a status.

Stage	What Metis does	What AppSec teams should measure
Index	Builds a knowledge base from source, build files, and documentation	Coverage, stale indexes, and private-code handling
Review	Runs security review at repository, file, or patch level	Valid-finding rate, reproducible evidence, and developer acceptance
Triage	Annotates SARIF findings as valid, invalid, or inconclusive	False-positive reduction, inconclusive backlog, and SLA compression
Adjudication	Applies deterministic rules after the LLM decision	Overconfidence control, unresolved hops, and contradiction handling

The supported scope matches security code review

The Metis README lists C, C++, Python, Rust, TypeScript, Terraform, Go, Solidity, TableGen, and Verilog as supported languages. C and C++ get Tree-sitter and flow analysis. Python, Rust, TypeScript, Go, Solidity, and Verilog use structural analysis and tool-based triage. The pyproject.toml entry points also show plugins for JavaScript, PHP, SystemVerilog, and AArch64 assembly.

The language list reflects Arm's likely priorities. C and C++ sit deep in runtimes, compilers, firmware, drivers, and libraries across the Arm ecosystem. Verilog and SystemVerilog point toward hardware verification. Solidity support suggests smart-contract security review. Terraform can help triage infrastructure-as-code policy and misconfiguration findings.

Arm's Newsroom post says Metis recently added Verilog support and that Arm is exploring hardware vulnerability verification. That is more than a minor language plugin addition. AI security review is expanding from application code toward hardware descriptions, firmware, system software, and cloud configuration.

Provider flexibility is another practical part of the design. The README starts with an OpenAI configuration example, but also mentions vLLM, Ollama, and LiteLLM. Organizations that cannot send proprietary code and security findings to external providers need local models or self-hosted inference. At the same time, Arm says its internal deployment uses OpenAI Daybreak's GPT-5.5-Cyber for defensive security workflows.

Official Arm Metis internal benchmark screenshot

Open source creates pressure to verify the claims

Metis is published under the Apache-2.0 license. Around May 29, 2026 at 19:00 UTC, a GitHub API snapshot showed 563 stars, 89 forks, and 15 open issues. The repository is mostly Python, and the README describes uv pip install ., Docker builds, an optional PostgreSQL dependency, and ChromaDB as the default backend.

Open source creates two kinds of pressure. First, security teams can inspect how Metis collects evidence and constructs prompts. Many closed SOC copilots expose only the result screen. Metis exposes plugins, prompt templates, triage flow, and vector-backend configuration. Second, attackers can inspect the tool's assumptions and weak spots. Open-sourcing a security tool increases transparency and attack surface at the same time.

This is where the deterministic triage document matters. If an agent loop roamed a repository and reached conclusions autonomously, reproducibility and audit would be difficult. Metis instead connects a SARIF finding, a bounded evidence pack, a structured decision, and deterministic rules. When an organization needs an audit trail, that structure should make it easier to answer why a finding was treated as valid, invalid, or inconclusive.

Metis still does not look like a replacement for SAST. Its README explicitly includes third-party SAST finding validation. A more realistic deployment is CodeQL, Semgrep, Snyk, Sonar, or another scanner producing candidates quickly, with Metis reducing the queue through repository context and evidence obligations before humans spend time on it. AI is not erasing the rule engine. It is trying to sort the queue the rule engine creates.

How this differs from recent AI security stories

Recent AI security news has split into two visible tracks. One track, represented by reports around Anthropic Mythos, focuses on exploit capability: how far a model can go on patched V8 CVEs or smart-contract vulnerabilities. Another track, represented by products such as Claroty Claire, focuses on operational security agents for cyber-physical systems, where asset identity, approval lines, and audit logs are essential.

Metis sits in a different place. It does not showcase exploit generation, and it does not lead with cyber-physical operations automation. It goes into the code review and SAST triage queue that software product security teams handle every day. The useful promise is not more findings. It is fewer bad findings, clearer inconclusive findings, and faster patch-diff review.

That difference affects developer experience. AI code review frustrates developers when it leaves confident but wrong comments. In security review, the cost is higher. If false positives are frequent, developers mute the tool and AppSec teams risk missing real vulnerabilities. Metis's claim of 50% fewer false positives is therefore both a productivity number and a trust number.

False negatives are the sharper risk. If Metis marks a real vulnerability as invalid, the result can become a security incident. The triage-flow document's forced inconclusive gates exist for that reason. A useful AI security tool does not simply automate more final decisions. It treats uncertainty as a product state and stops when the evidence is insufficient.

Questions before adoption

The first question is private-code handling. Metis puts source code, build files, and documentation into a knowledge base. If an external LLM provider is used, teams need to know which code fragments and finding details are sent, how logging and retention work, and how provider-specific settings are isolated. If local models are used, model quality and inference cost can become the new bottleneck.

The second question is how Metis overlaps with existing SAST. A team that already uses CodeQL or Semgrep should be careful about adding Metis as just another scanner. That can increase alert volume. A stronger starting point is to feed existing SARIF into Metis triage and connect valid, invalid, and inconclusive annotations to the review queue or ticket system. That lets the tool be measured as false-positive cost reduction rather than "one more AI security product."

The third question is evidence quality. Teams should inspect which file and line Metis cites, whether macro and include resolution failed, whether unresolved hops remain, and whether a model decision was downgraded by deterministic rules. A summary alone will not convince a maintainer. A security finding usually needs a reproducible path, affected asset, exploitability condition, and mitigation diff before it can be closed.

The fourth question is language-specific expectation. C and C++ flow analysis is not the same depth as structural analysis for Python or TypeScript. Verilog and Terraform also have different vulnerability definitions and review conventions than application code. Metis's plugin system creates extension points, but each organization will likely need to tune prompt templates, rules, and evidence obligations.

The fifth question is measurement. The KPI should not be finding count. Better metrics are false-positive handling time, inconclusive backlog, time to patch merge, developer dispute rate, and reopened findings. Arm emphasized false positives and true positives because the actual cost of security automation often lives in triage.

The competition is moving toward evidence

Arm's Metis release shows AI security tooling moving from "a chatbot answers" toward "a system collects evidence and assigns a status." Security teams cannot close findings from natural-language summaries alone. They need to know which function is the source, which guard is a sanitizer, where a macro expands, why a caller contract fails, and why a SAST rule fired incorrectly.

The strongest precedent in Metis is that the LLM decision is not the final decision. Deterministic evidence collection and adjudication sit around it. That is close to a minimum bar for AI security review. Even if a model reasons well, a security workflow needs reproducible evidence and auditable state transitions.

The unresolved questions remain significant. Arm's internal benchmark numbers have not been externally reproduced. The performance difference between a deployment using GPT-5.5-Cyber and one using a local model is not public. Arm has not broken down the 130 internal projects by language, codebase size, or bug class. Those details will determine how portable the results are.

Still, Metis sends a practical signal to developers and AppSec teams. As AI creates more vulnerability candidates, the bottleneck moves to triage. Tools that reduce the finding queue should be judged by evidence chains, provider boundaries, SARIF annotations, and deterministic gates more than model branding. The competition among security agents will not be won by answer speed alone. It will be won by proving why a finding should be trusted.