Devlery
Blog/AI

Microsoft MDASH preview ties 100 agents to Defender validation

Microsoft expanded MDASH preview at Build 2026 with 100+ agents, Defender integration, and vulnerability validation claims.

Microsoft MDASH preview ties 100 agents to Defender validation
AI 요약
  • What happened: Microsoft used Build 2026 to expand the MDASH preview and connect it with Microsoft Defender.
    • MDASH combines 100+ specialized agents and multiple models across discovery, validation, deduplication, proof, and remediation workflows.
  • The numbers: Microsoft cites a 96.55% CyberGym score and retrospective MSRC recall of 96%/100% on selected Windows components.
  • Builder impact: AI security tools are moving from model demos toward validate, dedup, prove, and patch workflow quality.
    • The Defender and GitHub Code Security integrations bring runtime context, internet exposure, and data sensitivity into remediation priority.
  • Watch: The expanded preview is limited to eligible organizations, and benchmark scores do not replace production false-positive and patch-throughput data.

Microsoft Security said during its Build 2026 security announcements that codename MDASH is moving into an expanded preview for eligible organizations. The announcement landed on June 2, 2026. MDASH is Microsoft's multi-model agentic scanning harness for security work. The new part is not simply that Microsoft has another "AI vulnerability scanner." Microsoft is broadening access to MDASH and connecting the results to Microsoft Defender so development and security teams can view the same candidate findings, prioritize by exploitability and production signals, then route fixes toward GitHub Code Security and Copilot Autofix.

In the Build post, MDASH appears beside Microsoft's broader agent security stack: Agent 365, ASSERT, and the Agent Control Specification. The sharper context comes from Microsoft's May 12 technical article on MDASH. In that post, Microsoft said MDASH helped identify 16 new vulnerabilities in the Windows networking and authentication stack, including four Critical remote code execution issues. Those vulnerabilities were handled as part of the May 2026 Patch Tuesday cohort.

Microsoft MDASH vulnerability discovery pipeline

MDASH is structured very differently from asking one model to read a repository and "find bugs." Microsoft describes a pipeline with prepare, scan, validate, dedup, and prove stages. The prepare phase gathers the source target and builds language-aware indexes, attack-surface maps, and threat models. In the scan phase, specialized auditor agents search candidate paths and produce hypotheses with evidence. In the validate phase, another cohort of agents argues over reachability and exploitability. Deduplication groups semantically equivalent findings. The prove phase creates triggering inputs when the bug class and harness allow proof.

Microsoft's most important framing is that "the model is one input, the system is the product." In MDASH, the product is not a single frontier model prompt. Microsoft says the system uses more than 100 specialized agents. Auditors, debaters, and provers have different jobs, tools, prompt regimes, and stopping criteria. The company gives one configuration example where state-of-the-art models act as heavy reasoners, distilled models handle high-volume debate, and another SOTA model provides an independent counterpoint.

100+
specialized agents
16
May 2026 Patch Tuesday CVEs
96.55%
CyberGym score cited on June 2

The May technical article shows what Microsoft wants readers to take from those design choices. In a private device-driver test called StorageDrive, MDASH found all 21 deliberately injected vulnerabilities and Microsoft said the run produced zero false positives. StorageDrive is a sample driver used in offensive security researcher interviews and had not been public, which lowers the chance that a model had seen the answers in training data. The planted issues included kernel use-after-free cases, integer handling flaws, IOCTL validation gaps, and locking errors.

Microsoft also presented retrospective benchmark results. Against pre-patch snapshots of five years of confirmed MSRC cases, MDASH achieved 96% recall for 28 clfs.sys cases and 100% recall for seven tcpip.sys cases. That does not mean MDASH will find the next set of bugs at the same rate. Microsoft explicitly notes that retrospective recall on a finite internal benchmark should not be treated as a prediction for the next 38 CLFS bugs. The number still matters because the ground truth was not a toy dataset. These were MSRC-confirmed cases that tied back to real Patch Tuesday and incident-response work.

The public benchmark story centers on CyberGym. The May 12 article describes CyberGym as 1,507 real-world vulnerability reproduction tasks from 188 OSS-Fuzz projects, and says MDASH scored 88.45% in its level 1 default configuration. In the June 2 Build security article, Microsoft said MDASH had improved by roughly 10% in less than three weeks to reach a 96.55% CyberGym industry benchmark score. The same post says multiple models, hundreds of agents, and more than 100 trillion daily signals help separate exploitable risk from theoretical noise.

The more operational evidence is the list of 16 CVEs. The May Patch Tuesday cohort included issues in tcpip.sys, ikeext.dll, telnet.exe, http.sys, netlogon.dll, and dnsapi.dll. Microsoft said 10 were kernel-mode vulnerabilities and six were user-mode vulnerabilities, with several reachable from a network position without credentials. The Critical examples included CVE-2026-33827, a tcpip.sys SSRR IPv4 packet use-after-free; CVE-2026-33824, an ikeext.dll IKEv2 SA_INIT double-free; CVE-2026-41089, a netlogon.dll CLDAP stack overflow; and CVE-2026-41096, a dnsapi.dll crafted UDP DNS response heap out-of-bounds issue.

Microsoft's long technical examples are there to support a narrower point: these were not obvious one-function bugs that a simple static prompt would reliably catch. In CVE-2026-33827, the problem sits in the Windows IPv4 receive path and the lifetime of a Path object. A reference is dropped, then the same pointer can be used later during SSRR processing. External subsystems such as the path cache scavenger, flush routines, and interface-state-driven garbage collection can drop the final reference. An attacker can influence the path through crafted IPv4 packet metadata, but the issue does not look like a straightforward release-then-use pattern inside one small function.

CVE-2026-33824 is easier to summarize but still crosses file and lifecycle boundaries. Microsoft describes an IKEEXT service double-free involving IKEv2 SA_INIT and fragmentation. It can be triggered with two UDP packets, with no race and no special timing. The root cause is a shallow copy: a receive context is cloned with a flat memcpy, but heap allocation ownership is not cloned correctly. Two contexts believe they own the same pointer and each frees it during teardown. Microsoft says the bug spans six source files and becomes visible when the faulty ownership lifecycle is compared with the correct pattern elsewhere in the same codebase.

Those examples explain why MDASH separates preparation, validation, and proof. Finding candidate bugs can easily become another backlog generator. Security teams need different evidence: whether a candidate is reachable from external input, whether it duplicates a known finding, whether a triggering input or ASan reproduction exists, and whether a patch prevents regression. Microsoft also describes a CLFS proving plugin. If a candidate finding exists but the system cannot create a triggering log file, it remains in triage. Domain plugins inject Microsoft-specific invariants such as on-disk container layout, block validation sequence, and in-memory state machines into the harness.

The Defender integration announced at Build 2026 moves this from research infrastructure into development workflow. Microsoft says MDASH-identified exploitable risks can flow into Microsoft Defender and GitHub Code Security integrations. The goal is to incorporate runtime context, internet exposure, and data sensitivity into vulnerability priority. From there, GitHub Copilot Autofix and the GitHub Copilot cloud agent can generate, assign, and validate AI-assisted fixes. The scanner output is not supposed to end as a security dashboard row. Microsoft wants it to become a pull request, issue, or fix workflow with production context attached.

StageMDASH roleDevelopment-team question
DiscoverAuditor agents produce candidate findings and evidence.Which code path and input source support this report?
ValidateDebater agents challenge and confirm reachability and exploitability.Is there an independent review loop that reduces false positives?
ProvePlugins and harnesses test triggering inputs or proof conditions.Does this create reproducible evidence, or only more tickets?
RemediateDefender, GitHub Code Security, and Copilot Autofix connect the fix path.Do runtime exposure and data sensitivity affect the priority queue?

MDASH also sits next to Anthropic's Project Glasswing story. Anthropic opened Claude Mythos Preview through gated access with a message aimed at critical infrastructure defenders and open-source maintainers. Microsoft is emphasizing a different layer. Rather than centering a gated frontier model, the MDASH announcement foregrounds the harness, Defender, GitHub Code Security, and Patch Tuesday ownership. Both stories start from the premise that AI can accelerate vulnerability discovery. Microsoft's version spends more space on whether the finding is proved and routed into a remediation process.

Community reaction around MDASH has not yet become a large public debate. The Korean research note found little sustained discussion on Hacker News or GeekNews directly around MDASH. Reddit posts in communities such as r/AZURE, r/SecOpsDaily, and r/accelerate mostly summarized the headline claims: more than 100 specialized agents and 16 Windows flaws. The related discussions are broader: how AI changes vulnerability disclosure, runtime security for agent frameworks, and responsibility for AI-era code review. MDASH is a concrete enterprise-preview version of that debate inside Microsoft's own security process.

For development teams, the first item to inspect is the evidence format. If an AI scanner report lacks affected version, reachable source, sink, exploit precondition, duplicate key, proof artifact, suggested patch, and regression test, speed turns into backlog growth. Microsoft's separation of debaters and provers addresses that exact failure mode. "This looks vulnerable" consumes maintainer time. "Two UDP packets reproduce this IKEEXT double-free, this host policy leaves it reachable, and the correct ownership pattern exists in another file" becomes a verifiable unit of work.

The second item is production signal. Vulnerability severity is not enough to decide remediation order. Two bugs in the same class have different urgency depending on whether the service is internet-facing, whether it touches sensitive data, and whether the exploit precondition exists in default configuration. Microsoft calls out internet exposure and data sensitivity in the Defender and GitHub Code Security integration because AI-generated findings will increase queue volume. Runtime context has to move into remediation triage, not remain in a separate operations system.

The third item is model portability. Microsoft says MDASH's targeting, validation, deduplication, and proof stages are model-agnostic enough that new models can enter the panel through configuration flips and A/B tests. Teams buying or building AI security tools should ask less about whether the current model is GPT-family or Claude-family and more about whether scope files, plugins, calibration, and historical CVE knowledge survive model rotation. In a market where model names change every few months, losing project-specific security knowledge every time the model changes raises the total cost of ownership.

The fourth item is ownership. The 16 CVEs Microsoft discussed became publishable security outcomes because they connected to internal code owners, Microsoft Offensive Research and Security Engineering, WARP, and the Patch Tuesday process. A company can deploy a similar agentic scanner and still fail if maintainer, AppSec, product owner, release manager, and customer communication roles are not defined. The success metric is not scan throughput alone. It is triage SLA, embargo policy, fix validation, downstream release process, and the ability to explain the change after deployment.

The fifth item is misuse and disclosure scope. Microsoft's technical post deliberately withholds some exploitation detail. In the IKEEXT double-free section, for example, it discusses the corruption primitive but says further exploit detail is not provided. As AI systems become better at reconstructing vulnerability paths, the boundary between technical transparency and exploit reproduction gets more sensitive. Development organizations need separate access control for internal tickets, prompt logs, proof-of-concept artifacts, and crash dumps. Output from a security model can be more sensitive than a normal code-review comment.

The practical conclusion is that the evaluation criteria for vulnerability discovery products are changing. In 2024 and 2025, the front question was often which model understood code best. In 2026, announcements like MDASH and Glasswing push a different question: after a vulnerability is found, who validates it and gets it fixed? Microsoft highlights 100+ agents and a 96.55% CyberGym score, but the more durable message is the workflow that connects Defender, GitHub Code Security, Copilot Autofix, and Patch Tuesday-style ownership.

The next metrics to watch are specific. First, how far MDASH's expanded preview moves beyond the initial eligible organizations. Second, whether CyberGym scores correlate with enterprise false-positive rate, duplicate rate, and patch acceptance rate on proprietary codebases. Third, whether the Defender and GitHub Code Security integrations work at the pull-request level with low enough friction for developers to keep using them. AI finding speed is now the starting point. Security and development teams need to know whether those findings become reproducible evidence, priority decisions, fix PRs, and deployment records.