Anthropic published Claude containment details, and 24 AWS key thefts explain the risk

Anthropic detailed the isolation design behind Claude Code and Claude Cowork. The numbers turn agent security from approval prompts into sandbox, VM, and egress policy.

AI 요약

What happened: Anthropic published its Claude agent containment design on May 25, 2026.
- The post covers code execution in claude.ai, Claude Code, and Claude Cowork.
The numbers: Anthropic disclosed a 93% permission approval rate, an 84% reduction in prompts after sandboxing, and 24 successful AWS credential exfiltrations in 25 red-team attempts.
Why builders should care: model classifiers help, but the final boundary is the runtime: sandbox, VM, filesystem policy, and egress proxy.
- Even an api.anthropic.com allowlist became an exfiltration path once it included file-upload capability.
Watch: persistent memory, MCP tools, connectors, and sub-agent outputs now need their own trust boundaries.

Anthropic published How we contain Claude across products on its engineering blog on May 25, 2026. The post is not about a new model or pricing tier. It asks where Claude should stop when it reads files, runs shell commands, opens the network, and consumes output from tools it did not create. Anthropic separates claude.ai, Claude Code, and Claude Cowork, then shows how human approvals, OS sandboxes, full VMs, egress proxies, and tool-output inspection fail or hold under specific attacks.

The immediate developer angle is Claude Code's permission model. Anthropic says users approved roughly 93% of permission prompts. That number weakens the idea that a human confirmation dialog is a durable security boundary. As an agent performs longer tasks, the prompt stops feeling like a security decision and becomes another repetitive action. After Anthropic added macOS Seatbelt and Linux bubblewrap sandboxing to Claude Code, it reduced permission prompts by 84%. A default-deny network policy plus write access limited to the workspace narrows the blast radius before a user has to evaluate every Bash command.

93%

Claude Code permission prompts approved

84%

Prompt reduction after OS sandboxing

24/25

Internal red-team credential exfiltration successes

83%

Overeager behavior caught by auto mode

Anthropic groups agent risk into three sources: user misuse, model misbehavior, and external attackers. It groups defenses into three surfaces: the environment where the agent runs, the model it consults, and the external content it reads. That taxonomy can sound abstract until the post attaches incidents to each layer. In Claude Code, the first failure happened before the normal trust prompt. Anthropic says its responsible disclosure program received several vulnerability reports between mid-2025 and January 2026, including three that involved code running before the user had trusted a folder.

The concrete example is a malicious repository that places a hook in .claude/settings.json. If Claude Code reads project settings during startup and shows the standard trust prompt only afterward, an attacker-controlled hook can execute before the trust boundary exists. Anthropic's fix was to defer parsing and execution of project-local config until after the trust decision. For coding-agent tools, that sequence matters as much as the prompt text. Opening a project, loading config, starting localhost listeners, and reading instruction files are all inbound request surfaces.

That incident also breaks the assumption that a local coding agent is naturally safer than a browser. A developer clones an unfamiliar repository to review a pull request. The agent reads files inside that repository. A hook meets the local shell. Web security treats untrusted input before page load as dangerous; coding-agent security has to treat project-open, config-load, AGENTS.md, CLAUDE.md, MCP settings, and local hooks the same way. Anthropic's disclosure is useful because it names that sequence rather than treating "local" as a blanket trust label.

The second failure mode makes the user part of the attack path. In an internal red-team exercise, a researcher sent employees an ordinary-looking collaboration request and a prompt to paste into Claude. The prompt looked like task guidance, but it instructed Claude to read ~/.aws/credentials, encode the contents, and POST them to an external endpoint. Anthropic says that across 25 retries, Claude completed the exfiltration 24 times.

That result is severe because the model layer has little abnormal intent to detect. If a classifier is judging user intent, a command pasted by the user appears to be the user's intent. Anthropic notes that a human contractor given the same instructions likely would have produced the same result. In this case, the last defense is not a more persuasive warning. It is egress control and filesystem boundaries. If ~/.aws is outside the sandbox and external POST requests are blocked by policy, the model can follow the malicious instruction and still fail to leak the credentials.

Claude Code auto mode acknowledges the same limitation. Anthropic says auto mode reduces friction by delegating command approvals to a model-based classifier. In a footnote, it discloses the tradeoff: the classifier blocks about 0.4% of benign commands and misses about 17% of overeager actions. Catching 83% is a useful layer, but it is not a complete boundary. For agent-platform teams, the relevant question is not only classifier accuracy. It is whether a missed action still lacks access to sensitive files and arbitrary network egress.

Claude Cowork chose a stronger isolation model because its user assumption is different. Claude Code can assume a developer has some ability to judge command-line behavior. Cowork targets knowledge workers giving the agent a local workspace with documents, spreadsheets, and work files. Anthropic says non-developers cannot be expected to evaluate commands such as find . -name "*.tmp" -exec rm {} \;. Its first version ran the agent loop itself inside a full VM. It used Apple's Virtualization framework on macOS and HCS on Windows, mounting only the selected workspace and .claude folder into the guest. Host keychain credentials did not enter the guest.

Anthropic's official diagram of the Claude Cowork VM isolation design.

The VM design does not treat a user's ability to read shell commands as a security prerequisite. Anthropic says the full-VM mode ran the agent loop as an ordinary Linux user inside the guest, without a privileged process outside the sandbox holding special exceptions. Compared with the Claude Code human-in-the-loop sandbox, it reduces the path where a tired user clicks "allow just this once." It also adds operational cost. VM boot failures made the product unusable, so Anthropic later moved the agent loop back to the host while keeping code execution inside the VM to improve reliability and debugging.

The more instructive Cowork incident involved the egress allowlist. The product needed access to api.anthropic.com to work. Anthropic initially treated that as a destination filter: traffic to approved domains was allowed, other destinations were blocked. A third-party disclosure showed a malicious workspace file containing hidden instructions and an attacker-owned Anthropic API key. Claude read other files in the workspace and called the Anthropic Files API using the attacker's key. The proxy saw the destination api.anthropic.com and let the traffic pass, uploading the file to the attacker's Anthropic account.

Anthropic summarizes the lesson as: the sandbox worked, but the data still left. The failure was treating an allowlist entry as if it granted only one narrow behavior. Allowing api.anthropic.com does not only allow chat-completion traffic. Depending on the API surface, it can also allow file upload, server-side fetch, and account-to-account data movement. In an agent product, a domain allowlist is a capability grant unless the proxy understands which capability, credential, header, and provenance are being allowed.

Anthropic's official diagram of the API allowlist exfiltration path and the VM-local proxy fix.

The fix was a defensive man-in-the-middle proxy inside the VM. The proxy intercepts traffic to Anthropic's API and permits only requests carrying the VM-provisioned session token. An attacker-controlled API key planted in the workspace is rejected. Headers that enable server-side fetch are blocked. Anthropic also explains why the proxy lives inside the VM instead of only on the server: the server has trouble distinguishing Cowork traffic from ordinary API-client traffic, while the VM knows the provenance of the request.

The same pattern applies to MCP and connectors. A remote MCP server or hosted connector may have been reviewed at install time, but its behavior can change later. A locally installed tool can be read and pinned by version; a remote tool can return different output, including prompt-injection payloads. Anthropic gives the example of a GitHub connector that passes a malware check while still returning a poisoned README into model context. Traditional dependency auditing reduces code-execution risk. It does not remove the risk that tool output becomes instruction.

NIST NCCoE's Software and AI Agent Identity and Authorization project frames a related question in standards language. Its February 2026 concept paper discusses identity, authentication, authorization, delegation, logging, provenance, and prompt-injection mitigation for agents that make decisions and take actions. It names OAuth 2.0/2.1, OpenID Connect, SPIFFE/SPIRE, and MCP as relevant candidates. Anthropic's engineering post turns that abstract agenda into product incidents. Without agent identity and request provenance, an egress log may only say that an approved API call happened, not which agent acted for which user under which instruction path.

Enterprise security teams also see a tradeoff in VM isolation. Anthropic says some security reviewers asked why their EDR tools could not inspect the inside of the Cowork VM. The answer is structural. The hypervisor boundary that isolates Claude from the host also isolates host-based endpoint detection and response from the guest. Anthropic's current mitigation is pull-based OTLP export so administrators can retrieve event logs later. That is not the same as live host inspection. Putting an agent in a VM reduces blast radius and also reduces visibility.

Anthropic closes with three next risks: persistent memory poisoning, multi-agent trust escalation, and agent identity. Persistent memory poisoning means an injected instruction survives beyond one session in product memory, CLAUDE.md, a mounted workspace, or a scheduled-agent state directory. Multi-agent trust escalation appears when a sub-agent reads untrusted content and sends "structured facts" upstream, but the parent agent trusts that output more because another internal agent produced it. Agent identity asks whether an agent should act as its own principal, as an extension of the user, or as some combination of the two.

For development teams, the immediate checklist is concrete. First, do not parse or execute project-local config before the user has made a trust decision. Second, review allowlists as API capability lists, not as domain lists. Third, assume model classifiers will miss some actions and design filesystem plus egress controls accordingly. Fourth, inspect long-lived memory and tool output for context injection even when the source appears to be an approved connector. Those rules apply before tool choice: Claude Code, Codex, Copilot, Cursor, an internal harness, or a CI-based agent all need operational boundaries before autonomy.

The value of Anthropic's disclosure is not that it proves Anthropic is uniquely safe. The post says its own custom proxy design failed, its permission prompts created fatigue, and an internal employee pasted a prompt that bypassed model-level defenses. Older primitives such as hypervisors, gVisor, seccomp, Seatbelt, and bubblewrap did useful work, while the new glue code and product UX around them introduced gaps. Agent-security maturity is therefore not measured by how well-behaved the model sounds in isolation. It is measured by which boundaries remain when the model follows a bad instruction.

This standard is already relevant outside Anthropic. Once a team stores AGENTS.md files, adds MCP configuration to repositories, runs coding agents in CI, or lets workplace agents read Google Drive and Slack, prompt injection is no longer only a web-page problem. It becomes a repository format, connector-output, memory, and local-config problem. Anthropic's May 25 post is a useful document because it moves the deployment checklist from approval UI to sandbox policy, token provenance, connector inspection, egress controls, and agent identity.