Foundry Hosted Agents near GA with per-session sandboxes

Microsoft Foundry expanded its agent operations layer with hosted agents, Toolboxes, procedural memory, and Agent Optimizer.

AI 요약

What happened: Microsoft used Build 2026 to outline the next phase of Foundry Agent Service, including hosted agents moving toward general availability.
- The Foundry blog says GA within 30 days, while the Build recap points to early July 2026, with per-session sandboxes and dedicated filesystems as the headline runtime details.
Operating layer: Agent Framework v1.0, Toolboxes, procedural memory, tracing, evaluations, and Agent Optimizer form the post-prototype surface.
Builder impact: The evaluation question shifts from model choice to runtime isolation, tool governance, memory policy, trace quality, and cost telemetry.
- Several capabilities remain in preview, and Microsoft 365 Copilot or Teams publishing still depends on tenant identity flows and policy configuration.
Watch: Microsoft cites sub-100 ms cold starts and zero idle cost, but GA pricing, regions, and real production logs need customer validation.

Microsoft published a Foundry Agent Service update on June 2, 2026, during Build 2026. The center of the announcement was not a new frontier model or another chat surface. Microsoft described the runtime and operations stack needed after an agent prototype starts handling production work. The product story is organized around three layers: build, deploy, and operate. Inside those layers, Microsoft places Agent Framework, Toolboxes, hosted agents, memory, publishing into Teams and Microsoft 365 Copilot, tracing and evaluations, and Agent Optimizer.

This is a different Microsoft agent story from the recent Work IQ APIs, MAI model launches, or Windows isolation updates. Work IQ focused on Microsoft 365 context and credits. The Windows 365 agent story focused on local or desktop-style containment. This Foundry update asks where an agent runs, which tool endpoint it calls, how session state is isolated, and how failed runs return as evaluation data and improvement candidates. For development teams, the practical question is less "which model should we use?" and more "who can inspect and fix an agent that failed overnight?"

Official Microsoft Foundry Toolboxes demo GIF

The Foundry blog names the current bottleneck directly. Local prototypes have become easier because coding agents and orchestration frameworks can help create the first version quickly. Enterprise workflows are harder because every tool and data source brings its own authentication model, protocol, lifecycle, and permission boundary. Production agents also need session isolation, durable state, runtime capacity, traces, evaluations, and a route from failure data back to improvement. Microsoft compares the situation to the discovery, isolation, observability, and deployment problems microservices faced roughly a decade ago.

Agent Framework becomes the harness

The first component is Microsoft Agent Framework. The Foundry Build Edition recap describes Agent Framework as a stable release for Python and .NET, with an agent harness, skills, memory, and middleware. The same recap lists GitHub Copilot SDK and Claude Agent SDK integrations as stable, and includes the Magentic-One multi-agent orchestration pattern in the stable-release set.

That framing is important because Microsoft is not trying to make every team rewrite an agent into a single new framework before it can run on Foundry. The message is closer to runtime compatibility. A team that has already invested in LangGraph, Copilot SDK, Claude Agent SDK, or Microsoft Agent Framework should be able to connect that work to the Foundry deployment and operations surface. The commercial opportunity for Microsoft sits in the control plane around agents as much as in the framework itself.

The framework also gives Microsoft a cleaner place to define skills, middleware, memory, and multi-agent coordination without tying every capability to one chat product. That matters for teams building agents that are not simply assistants. A code-review agent, a report-generation agent, or a support-triage agent needs execution rules, tool boundaries, memory policy, and failure handling. The framework is the harness where those rules become reusable instead of being repeated inside prompts.

Hosted agents move from prototype runtime to product runtime

The second component is hosted agents. Microsoft says Hosted agents in Foundry Agent Service will reach general availability within 30 days, while the Build Edition recap says general availability is expected in early July 2026. The technical promise is specific: each session runs in its own sandbox with dedicated compute, memory, and filesystem access. The runtime is framework-agnostic, and Microsoft says agents built with Microsoft Agent Framework, GitHub Copilot SDK, LangGraph, or other SDKs can be deployed without a rewrite.

Microsoft Build Live attaches stronger operational claims to the same runtime. It says hosted agents execute untrusted code in per-session sandboxes, with sub-100 ms cold starts and zero idle cost. Those two phrases matter for teams calculating agent operations costs. Some agents need low-latency user interaction, such as voice and chat flows. Others wake on a schedule, triage issues, generate reports, or monitor repositories between long idle periods. Zero idle cost is a meaningful promise for the second category, but billing details, telemetry, and customer invoices will need to confirm how it works after GA.

The hosted-agent protocol story is split into two paths. Foundry supports a Responses API for OpenAI-compatible stateful interaction. It also supports an Invocations protocol for schema-free pass-through when the developer wants more control over request and response format. That split says Microsoft is not defining agent hosting only through OpenAI-style API compatibility. Teams that already have an orchestration stack can pass their payload through the Invocations path, while teams that want stateful interaction can use the Responses API shape.

Long-running agents and routines are part of the same deployment story. Microsoft says hosted agents support long-running autonomous agents such as OpenClaw and Hermes, durable state, filesystem access, and routines in public preview. The example is an agent that watches a GitHub repository overnight, triages new issues, and posts a summary to Teams before standup. That example is useful because the hard parts are scheduling, identity, filesystem access, notifications, and failure recovery, not only language-model quality.

Toolboxes compress the tool surface

Toolboxes in Foundry are the next major piece. The Build Edition recap says Toolboxes are in public preview and expose tools, skills, MCP clients, and governance as a managed endpoint. A developer configures tools once, then points an MCP-compatible client at one URL. Foundry handles authentication, lifecycle, and governance around that endpoint.

The reason is practical. An agent prompt with an ever-growing list of tools does not scale well. As the tool count rises, the model spends tokens and probability mass deciding which tool to call, and failures become harder to diagnose. Toolboxes move the system toward a managed tool endpoint where enterprise data, MCP clients, Work IQ, Fabric IQ, and Foundry IQ can sit behind the same access surface. Microsoft describes this as reducing custom plumbing. For operations teams, it is also a control point: which identity called which tool version, which tool was exposed to which agent, and what audit trail was produced.

Skills and tool search fit into that design. Microsoft says skills are project-scoped, versioned resources in preview, and tool search can select tools by task. Those features sound ordinary until a production agent has dozens of possible connectors. A support agent, for example, may need CRM, billing, documentation, incident status, identity, and internal ticketing tools. If the wrong connector is exposed or selected, the agent can fail even when the model answer looks plausible. Tool search quality and tool-version governance become measurable parts of the deployment.

Area	Build 2026 status	What teams should verify
Hosted agents	GA within 30 days or early July 2026	Sandboxing, filesystem, cold start, idle cost, regions, pricing
Agent Framework	Stable release for Python and .NET	Integration with LangGraph, Copilot SDK, and Claude Agent SDK
Toolboxes	Public preview	MCP endpoint setup, tool versions, auth, audit, tool search quality
Memory	Procedural, user, and session memory in public preview	Storage scope, deletion policy, repeated-task success rate, token cost
Operate loop	Tracing and evals in preview; Agent Optimizer soon in preview	Trace-to-eval linkage, rollback, approval workflow, ROI metrics

Procedural memory is the most concrete memory claim

The memory update is more interesting when reduced to the exact categories Microsoft names. Memory in Foundry Agent Service is in public preview and includes procedural, user, and session memory. Procedural memory is not just remembering what a user said. It is remembering how the agent should perform a task across runs.

Microsoft's example is a pull-request review agent. A developer can coach the agent once to check test coverage first, flag new dependencies, and look for breaking API changes. Weeks later, the agent should apply that procedure to another pull request without the team attaching the same instructions to every prompt. If this works reliably, the agent's operating procedure becomes a stored resource rather than prompt boilerplate.

Microsoft cites early Tau-bench results showing +7-14% absolute success-rate gains from procedural memory at near-baseline cost. That number should be read carefully. The announcement does not provide enough independent detail about benchmark setup, task distribution, baseline cost, memory write policy, or memory retrieval policy to treat it as a general performance guarantee. It is still a useful signal about product direction. Microsoft wants repetitive workflow knowledge to move out of long prompts and into memory that the operations layer can inspect and tune.

Memory policy will be one of the hard production questions. User memory, session memory, and procedural memory carry different privacy and governance implications. A session memory item may be discarded after a task. A procedural memory item may intentionally persist because it changes how the agent works. A user memory item may need explicit deletion, scope control, or compliance review. Teams evaluating Foundry memory should ask what is stored, who can inspect it, how it is deleted, whether it appears in traces, and how memory affects cost.

The operate loop tries to connect traces, evals, and fixes

The operate layer is the part of the announcement closest to production operations. The Build Edition recap says tracing and evaluations for any agent framework are in public preview, including LangChain, Semantic Kernel, and custom agent frameworks. Agent Optimizer is described as coming soon to public preview. The workflow is that Foundry AI Operations Service runs an evaluation suite, sends results into Foundry Optimizer, and produces ranked improvement candidates across prompts, tools, skills, and context. Agent ROI is in private preview for task completion rate, time saved, and cost efficiency.

This loop only works if traces and evaluations point to the same incident. Agent failure is rarely a single "wrong answer" event. The agent may have selected the wrong tool, lacked permission, reused stale memory, retrieved from the wrong source, hit a runtime policy block, or passed malformed arguments into a connector. If production logs and evaluation results live in separate systems, the team still has to reconstruct the failure manually. Microsoft's product direction is to make the failed run produce a ranked set of changes rather than another dashboard.

The tricky part is approval. An optimizer that proposes prompt, tool, skill, or context changes still needs a release path. Some changes are low risk, such as rewriting instructions for a report format. Others can alter which tools the agent selects or what data it retrieves. In enterprise environments, the operate loop needs rollback, review, change history, and policy enforcement. The announcement points in that direction, but actual deployments will decide whether Agent Optimizer becomes a trusted improvement loop or another recommendation panel.

Governance and distribution enter the same announcement

Governance appears through Agent Control Specification and Agent 365. Build Live says Agent Control Specification is in preview and is meant to define and enforce what production agents can do across Foundry, Microsoft Agent Framework, and LangChain. Microsoft also describes Agent 365, Entra, Purview, and Defender as a way to catalog an agent estate, show who deployed agents, identify what data and tools they can access, and monitor cost and behavior.

That matters once an organization moves from a handful of prototypes to dozens or hundreds of agents. A model registry does not answer which user sponsored an autonomous agent, which data sources it can reach, which MCP servers it attached, or why it called a tool last night. Microsoft is assembling an enterprise control plane around those questions using the products many large customers already have in place.

Distribution is another Build 2026 axis. The Foundry blog says publishing Foundry agents into Microsoft Teams and Microsoft 365 Copilot is planned for general availability in June 2026. The claim is that identity, permissions, and policy flow automatically. Microsoft also adds autopilot agents in public preview, alongside assistive and autonomous agents. Autopilot agents can act independently and have Entra Agent ID, an email address, Microsoft Teams presence, and a position in the organization chart.

Those details make the agent feel less like an app integration and more like a managed organizational actor. That is powerful, but it is also where deployment risk concentrates. Teams and Microsoft 365 Copilot publishing can look simple in a demo while depending on tenant login, Entra policy, app consent, Copilot exposure, Teams presence, and audit-log configuration. The Korean source noted a Reddit r/AZURE case where a user trying to expose a Foundry agent through Microsoft 365 Copilot, Teams, or Web365 was stuck waiting for Foundry login completion. It is only one community report, but it shows how a distribution path can become an identity-flow problem in practice.

The competitive frame is agent operations

The competitive context is now larger than model APIs. AWS is moving with Bedrock AgentCore and Bedrock Managed Agents around runtime, gateway, identity, and observability. Google is pushing Gemini API managed agents and Antigravity-style developer workflows with sandboxed execution. OpenAI has Agents SDK and Codex products that expand the agent harness around code and tool use. Vercel targets agent execution inside web applications through Sandbox and AI Gateway.

Microsoft's advantage is the enterprise control plane. Foundry can attach to Microsoft 365, Teams, Entra, Purview, Defender, Azure model catalogs, and existing tenant policy. That does not automatically make it the best runtime for every agent. A lightweight code-execution agent may fit another sandbox. A web-app product team may prefer a platform-native runtime. But for enterprises already operating inside Microsoft identity, compliance, and collaboration surfaces, Foundry can offer a shorter path from agent prototype to governed deployment.

The evaluation checklist should reflect that shift. For hosted agents, teams should test session isolation, filesystem persistence, routine scheduling, cold starts, supported regions, private networking, log retention, and the actual billing behavior behind zero idle cost. For Toolboxes, teams should test MCP endpoint configuration, credential rotation, tool-version history, tool search errors, and audit events. For memory, they should test storage scope, deletion controls, procedural-memory drift, and whether repeated-task success improves without increasing hidden prompt cost.

The strongest sentence in the announcement is the diagnosis that agent prototypes are easy and production agents are hard. Microsoft then breaks that difficulty into runtime, tools, memory, distribution, observability, and governance. If Foundry Hosted Agents reach GA on the announced timeline, the comparison point will not be only the model list. It will be the per-session sandbox, tool endpoint, trace-to-eval loop, tenant policy, and cost telemetry around the agent. Build 2026's Foundry update is not just about building agents. It is about keeping them running inside enterprises after the demo works.