Microsoft ASSERT and ACS split agent policy into tests and runtime blocks

Microsoft introduced ASSERT and ACS for agent governance, turning natural-language policy into evaluation artifacts and runtime enforcement points.

AI 요약

What happened: Microsoft introduced ASSERT and ACS during its Build 2026 announcement cycle.
- ASSERT turns natural-language requirements into evaluation data and scorecards, while ACS enforces policies at eight points in an agent loop.
Developer impact: Agent safety is moving from system prompts and SDK callbacks into manifests, traces, policy artifacts, and audit records.
Why it matters: Teams can test agent behavior before release and block risky tool calls or outputs during runtime with a separate contract.
- The model can still reason probabilistically, but the host runtime gets deterministic places to apply allow, warn, deny, or escalate.
Watch: Microsoft also flags LLM-judge variance, synthetic-test limits, and possible breaking changes before general availability.

Microsoft used the Build 2026 announcement window on June 2, 2026 to publish two pieces of agent-governance infrastructure. The first is ASSERT, an open-source framework that turns natural-language behavior requirements into executable evaluations. The second is Agent Control Specification, or ACS, a manifest and SDK contract for checking policies as an agent receives input, calls a model, invokes tools, and returns output.

The release is more than another guardrail product because Microsoft placed evaluation and enforcement in the same operating loop. The Microsoft Foundry Build recap groups ASSERT, ACS, Guided Guardrail Setup, Rubric, tracing, Agent Optimizer, and Agent ROI under trust, evaluation, and observability. The message is that a production agent needs both pre-release evidence that it behaves inside the intended boundary and runtime checkpoints that can stop an action before it reaches an API, file, inbox, or database.

Many teams still handle agent policy through a familiar stack: write prohibitions in a system prompt, add custom checks in framework callbacks, run classifiers on input and output, then inspect traces when something fails. Microsoft’s ACS post argues that this pattern fragments across frameworks and leaves policy engines without a standard view of where the agent is in its loop. Traditional IAM can answer whether a credential may call the Slack API. It does not answer whether this agent, in this conversation, after reading a confidential document, should post a summary into a channel that includes external members.

ASSERT evaluation framework diagram

ASSERT handles the evaluation side of that problem. Microsoft’s Responsible AI team describes it as a framework for converting “natural language behavior requirements” into executable evaluations. The inputs are documents teams already write: product requirements, policy documents, system prompts, launch checklists, safety reviews, and notes from domain experts. The outputs are local artifacts such as taxonomies, test sets, inference traces, scores, and metrics.

The pipeline has four stages. ASSERT first turns broad behavior expectations into concept specifications, then breaks those concepts into a taxonomy of allowed and disallowed behaviors. It then generates stratified test cases across dimensions such as task type, persona, available tools, request class, and environment configuration. The third stage runs those test cases against the target model, agent, or application workflow. The final stage scores each trace against the taxonomy and policy stance, attaching verdicts, rationales, and policy citations.

Microsoft’s example is a travel-planning agent with five tools: search_flights, search_hotels, check_weather, check_travel_advisories, and validate_budget. The agent must build an itinerary inside a six-turn budget. The evaluation looks for missing or incorrect tool use, fabricated flights, fabricated hotels, fake prices, budget violations, stereotyping, tool-output prompt injection, and sycophantic agreement with unsafe travel plans. That example explains why final-answer scoring is not enough for agents. A polished itinerary can hide a wrong tool argument, ignored advisory, or manipulated tool result.

The ASSERT announcement includes two internal validation signals. In a coverage study across five behavior categories and three target models, Microsoft says ASSERT covered about 1.2 times more of the intended behavior space than a comparable in-house baseline. It surfaced about 1.5 times more inspect-worthy cases, separated stronger and weaker systems by more than 4 times, and reduced saturated cases by roughly half. Microsoft also reported about twice as many failure patterns, while warning that failure-type labeling was less stable than coverage or model separation and should be read directionally.

The second validation track concerns judging. Microsoft says it ran LLM-judge scoring across more than ten behavior concepts, then sampled by risk level for human and independent review. LLM-judge agreement with human annotators typically landed in the 80-90% range, while human inter-annotator agreement was around 90%. Those numbers do not turn ASSERT into a compliance-certification system. Microsoft explicitly notes that judge strictness, boundary sensitivity, and near-neighbor behavior distinctions can vary.

ACS takes the other side of the release process: runtime enforcement. Microsoft describes ACS as an open, vendor-neutral standard that is independent of a particular framework, runtime, or policy engine. Its central artifact is a portable manifest. The manifest defines which policy runs at which intervention point, what snapshot the host runtime sends, how evidence providers such as classifiers, DLP systems, and LLM judges attach their findings, and how the host applies verdicts such as allow, warn, deny, escalate, or redaction.

ACS defines eight interception points. agent_startup checks configuration and environment before the agent starts. Input inspects user input before it enters model context. pre_model_call examines the full context before the model is called, and post_model_call checks the model response before the runtime executes it. pre_tool_call evaluates tool names and parameters before execution, while post_tool_call inspects tool output before it re-enters model context. output checks the final response before the user sees it, and agent_shutdown covers session termination and audit handling.

The difference from many existing guardrails is the shape of policy input. A canonical ACS input can include intervention_point, policy_target, snapshot, annotations, and tool. The snapshot can carry actor identity, roles, conversation state, prior tool calls, data sensitivity, and approval status. The annotations field can carry evidence from a classifier, DLP system, LLM judge, or external service. A policy engine receives that structured input, returns a normalized verdict, and the host runtime applies the result.

Microsoft’s email-agent example connects a Rego policy to pre_tool_call in a manifest. When the agent tries to call send_email, the policy examines the arguments and denies the call if the recipient is external. Microsoft says the same manifest and snapshot can be read by Python and Node hosts to produce the same verdict. The Agent Governance Toolkit also includes .NET and Rust conformance fixtures for the same snapshot-verdict behavior.

The word “standard” needs careful reading here. As of June 2026, ACS is an open specification and reference implementation published by Microsoft, not a settled industry standard ratified by a consortium. Still, the practical signal is clear. Microsoft names CrewAI, LangChain, AutoGen, Semantic Kernel, OpenAI Agents SDK, and Anthropic tool-use callbacks as part of the same problem space. Agent frameworks may change, but enterprise teams want one policy bundle to follow the agent across runtimes.

The Agent Governance Toolkit repository shows where Microsoft expects ACS to run. The README labels the toolkit as Public Preview and starts with pip install agent-governance-toolkit[full] for Python 3.10 or later. It lists examples for Python, TypeScript, .NET, Rust, and Go. The repository’s framing is blunt: OAuth scopes and IAM roles control which services an agent can reach, but not what the agent does after it is connected. It calls prompt-level safety a polite request to a stochastic system, not an enforceable control surface.

That sentence marks the practical split in agent security. Prompt-injection defenses become fragile when user input, retrieved content, tool output, and attacker-controlled text all share the same token stream. ACS tries to move the stop point out of that stream. Whether the model was persuaded or confused, the host runtime can block delete_file, drop_table, an external email, or confidential-data forwarding before the action leaves application code.

For development teams, ASSERT and ACS together can change the release process. ASSERT can turn requirements such as “refunds under this threshold may be auto-approved,” “restricted findings must not be quoted in public summaries,” or “change-control agents may not cross an approval boundary” into test sets and scorecards. ACS can carry the related organization policy into pre_tool_call or output checks that deny, warn, or escalate during runtime. One piece finds failures before deployment. The other blocks behavior after deployment.

The security-team workflow changes as well. In the older pattern, an agent owner writes custom logic inside an SDK callback and a reviewer reads that logic in a pull request. In the ACS pattern, the security team can version a policy bundle and manifest, the host runtime can provide canonical inputs and evidence, and the audit log can record which policy allowed or denied a tool call. For regulated or internally audited workflows, the sharper question is not “did the model decide this was safe?” It is “which policy ran at which point, with which evidence, and what verdict did the host enforce?”

Microsoft Security’s Build 2026 post puts the same theme into the development lifecycle. It mentions agentic risk detection, policy control, Microsoft Purview controls, Data Security Posture Management, and risk detection for coding agents such as Claude Code, GitHub Copilot, OpenAI Codex, and OpenClaw. Windows adds OS-level containment through the MXC SDK. Foundry adds hosted agents, toolboxes, memory, tracing, Agent Optimizer, and ROI dashboards. ASSERT and ACS sit inside that larger package as the artifacts that make policy visible as files, traces, and verdicts.

The competitive field is already active. OpenAI Agents SDK has guardrails. Anthropic’s tool-use flows can apply callbacks and policies. LangChain callback handlers, Semantic Kernel filters, and direct OPA/Rego integrations are already used in production-like systems. ACS does not erase those extension points. It argues that the market lacks a standard hook shape, input structure, and enforcement contract that can travel above individual frameworks. Adoption will depend on whether adapters stay maintained and whether policy authors can use the same manifest outside Microsoft’s stack.

ASSERT also competes with existing evaluation infrastructure: OpenAI Evals, LangSmith, Arize Phoenix, OpenInference-based tracing, and internal benchmarks. Its more specific claim is that behavior specification should become a first-class input. Instead of only measuring generic helpfulness, relevance, groundedness, or toxicity, ASSERT starts from product requirements and policy boundaries, then generates the test taxonomy around those requirements. The repository says it can use LiteLLM for more than 100 model endpoints and OpenInference auto-instrumentation to capture traces from more than 33 frameworks with two lines of code.

There are real caveats. ASSERT creates synthetic interactions, so failures that appear only in production traffic can still be missed. Ambiguous policy documents produce ambiguous taxonomies and test cases. LLM judges vary in strictness and boundary decisions. ACS depends on host runtimes passing accurate snapshots, evidence providers responding reliably, and product teams deciding what happens when a provider fails or arrives late. The toolkit is in Public Preview, so breaking changes before general availability remain part of the adoption cost.

The durable lesson for AI builders is not that Microsoft has already won the agent-governance layer. It is that agent safety requirements are becoming executable artifacts. Before release, requirements need to become datasets, traces, scores, and reviewable failure cases. After release, the same organization policy needs to run at checkpoints that can return allow, warn, deny, or escalate. Microsoft put ASSERT and ACS on the same Build 2026 stage because enterprise adoption is increasingly blocked by that artifact layer, not only by model capability.

Community discussion is still early. The Korean research pass did not find large Hacker News or GeekNews threads focused specifically on ASSERT or ACS. TechCrunch framed ACS as Microsoft’s attempt to let developers, compliance teams, and security teams define agent policies and inspect behavior at granular interception points. A Reddit r/AI_Agents thread grouped Microsoft ACS with Noma, Netskope, Immuta, and Outreach announcements and asked whether agent governance is becoming as important as agent capability. The volume is small, but the question matches where production agents are heading: when agents call APIs and change records, buyers ask who approved the action and what evidence remains.

For teams already building agents, three checks are useful now. First, can agent requirements be converted from PRD or policy prose into evaluation artifacts? Second, does the runtime have hooks immediately before tool calls and final output? Third, do audit records keep traces, policy citations, verdict rationales, and evidence for sensitive actions? ASSERT and ACS may or may not become the exact standard the industry adopts, but those three questions belong in a 2026 agent release checklist.