Devlery
Blog/AI

Claude’s five-hour error window asks how agents recover from model outages

Claude’s June 2 model error incident shows why AI agent products need retry, checkpoint, fallback, and human handoff design.

Claude’s five-hour error window asks how agents recover from model outages
AI 요약
  • What happened: Claude Status logged an Opus 4.6 elevated errors incident on June 2, 2026, from 06:04 to 11:49 UTC.
    • The public updates moved through investigating, identified, monitoring, and resolved states across a 5 hour 45 minute incident window.
  • Developer impact: Teams using Claude API, Claude.ai, or Claude Code need to treat model errors as workflow failures, not only chat interruptions.
  • Operational lesson: Agent runners need retry policy, checkpoints, idempotency keys, fallback rules, and human handoff before outages happen.
    • A coding agent that loses state can leave a partial diff, duplicate an external write, or resume with the wrong model permissions.
  • Watch: The official status page did not publish a root-cause breakdown, so this article separates the verified timeline from operational design implications.

Anthropic's Claude Status page recorded an elevated errors incident for Claude Opus 4.6 on June 2, 2026. The public timeline started at 06:04 UTC with an investigation, moved to identified at 06:39, continued with a fix in progress at 09:33, entered monitoring at 10:42, and was marked resolved at 11:49. The visible incident window was 5 hours and 45 minutes.

That record is easy to file under ordinary service availability, but the impact is different when Claude is running inside a developer workflow. A chat session that hangs is annoying. A coding agent that hits a model error while editing a branch can leave a half-written diff, an incomplete transcript, an unreviewed shell command, or a task queue that retries the same external action. For agent products, model availability is part of workflow recovery.

Claude Status incident timeline for June 2, 2026

Source: Claude Status.

Automated Reddit status alerts in r/ClaudeAI also preserved model-level signals from the same day. One June 2 alert at 06:39 UTC referenced elevated errors across Claude Opus 4.7, Opus 4.6, and Sonnet 4.6. Another at 10:42 UTC focused on Opus 4.6 elevated errors. Those posts are closer to status-page mirrors than community analysis, but they show that model-specific availability is now tracked by developers as a practical operating event.

TechRadar covered the same Claude Status updates and described user-facing symptoms such as Claude.ai feeling slow or remaining in a "still working on it" state. That difference matters. In a chat UI, the user sees a stalled answer. In an agent runner, the failure can appear as an HTTP error, timeout, retry loop, incomplete tool transcript, or lost continuation state.

The public status page did not explain the underlying cause. This article does not infer GPU capacity, deployment bugs, routing behavior, abuse filtering, or model-serving changes from the status record. The verified facts are narrower: a June 2 incident from 06:04 to 11:49 UTC, the public status phases, and external status alerts that mentioned Opus 4.6, Opus 4.7, and Sonnet 4.6 during the same period.

The first design question for agent teams is retry. LLM retries are not the same as retries for a small idempotent REST call. Sending the same prompt again can spend the same tokens again, trigger another tool call, and produce a materially different plan. If an issue-triage agent creates a Jira ticket after a model timeout, a blind retry may create a duplicate ticket. If a deployment assistant reruns a migration command, the second attempt may not be harmless.

The second question is checkpointing. Long-running coding agents need to persist the plan, current branch, applied diff, test results, failed command output, and approval state at each meaningful step. When the model comes back 20 minutes later, the product should resume from the last verified state instead of starting from the original prompt. Teams that use Claude Code, Codex, Gemini CLI, and internal runners should avoid binding checkpoints to one vendor transcript format if the work needs cross-tool recovery.

Fallback is the third question, and it is more restrictive than "send the request to another model." Moving from Opus 4.6 to Sonnet 4.6 or to another provider can change context limits, tool-use behavior, safety policy, reasoning quality, latency, and cost. A safe fallback policy should be keyed by task class. Read-only analysis can often move across models with a note about drift. File edits should resume only after checking the last diff. External writes need idempotency keys. Pending approvals need a human-facing status summary before another model continues.

Task stateOutage riskRecovery rule
Read-only analysisThe conclusion can drift across models or retries.Retry with model, cost, and status metadata attached to the trace.
File edits in progressThe agent may leave partial changes or conflict with a resumed run.Checkpoint the diff, re-read the worktree, then resume from the last verified step.
External API writeA retry can duplicate a ticket, alert, payment, or database mutation.Require an idempotency key and store the previous tool result before retrying.
Approval pendingThe user can return after the model context, permission boundary, or queue state changed.Show the saved state, requested permission, and next action before continuing.

Status-page integration therefore becomes a product requirement, not just an operations dashboard. A user should be able to see the provider status, model name, request ID, retry count, last successful step, and preserved artifacts before asking why an agent stopped. After an incident clears, the product needs a policy for whether it automatically resumes, requests a review, retries with the same model, or rebuilds the plan with a fallback model.

The SLO for an agent product cannot be copied directly from the provider's uptime number. User experience is shaped by model availability, queue capacity, rate limits, token budgets, sandbox health, repository locks, and external tool state. Claude's incident was visible for 5 hours and 45 minutes, but that does not mean every agent task stopped for exactly that long. Some tasks may have recovered quickly. Others may have remained blocked after the status page turned green because queues, retries, or review backlogs still had to drain.

The phases on the Claude Status page give agent teams a useful state machine. During investigating, a product can pause new long-running jobs and keep short read-only tasks separate. During identified, it can continue already-started jobs only when checkpoints and idempotency controls are present. During monitoring, it can restart low-risk work before resuming tasks that modify files, call external systems, or require human approval. That mapping is more useful than a generic banner saying the AI provider is degraded.

Queue design needs the same level of detail. If a user asks an agent platform to patch 200 repositories and a provider outage starts midway, FIFO execution is too blunt. A read-only scan, a pull request draft, a production config edit, and an external ticket write carry different side effects. Queue metadata should include task class, write scope, external system, idempotency key, approval requirement, branch name, and fallback eligibility. Without that metadata, an outage turns into manual triage.

Observability also has to connect model calls to workflow artifacts. A useful trace should tie together prompt ID, model name, provider status snapshot, tool-call ID, branch, commit hash, sandbox ID, retry attempt, and fallback decision. After an incident like June 2, reviewers should be able to answer whether a pull request was created during degraded service, whether a fallback model edited the diff, and whether the final tests ran after the recovery. Logs that only say "model request failed" will not reconstruct those facts later.

User-facing language affects trust during the outage. "The AI is thinking" hides the difference between long reasoning, a provider error, an internal retry, and a stuck tool call. A better interface tells the user whether the provider is degraded, how many retries remain, what will be preserved if they cancel, and whether the next attempt could use a different model. In a general chat product, a cancel button mostly stops a conversation. In an agent product, the same button may decide what happens to a branch, log, ticket, or audit trail.

Postmortems should separate provider responsibility from internal agent design. The June 2 record belongs to Anthropic's model service. But if a customer saw duplicate tickets, corrupted branches, runaway token spend, or fallback models with broader permissions, those are product-layer failures. A useful internal postmortem should list the provider incident URL, internal impact window, affected task classes, failed retry policy, exposed customer message, manual recovery time, and any side effects that needed cleanup.

Runbooks need to evolve accordingly. A traditional API outage runbook might say to check the status page, widen retry intervals, and notify support. An agent runbook needs decisions about whether to lock active branches, preserve partial diffs as draft pull requests, pass transcripts to fallback models, stop external writes, and ask humans to re-approve shell commands. Those are not model-evaluation questions. They are workflow recovery procedures.

Contract and security reviews should ask the same questions before production deployment. When a provider status page changes to resolved, does the customer's SLA clock stop immediately, or only after queued work is recovered? Can prompts, source code, and tool outputs move to a fallback provider during degraded service? How long are outage-era logs retained? Which model permission profiles are allowed to execute writes? A fallback policy is also a data-sharing and authorization policy.

Cost control is part of the same failure path. Aggressive retries can add request cost, token cost, and human cleanup time even when the original job did not finish. A fallback model may have a lower per-token price, but the full recovery can require longer prompts, extra verification, and another review pass. For agent products, the expensive part of an outage is often not the model price table. It is the number of times a partially completed workflow has to be reconstructed.

Security teams should also prevent fallback from becoming a permission escalation path. If repository writes are normally allowed only for a specific model and runner profile, an outage should not automatically pass the same write privileges to a different provider with a different policy surface. Organizations that separate planning models, code-writing models, review models, and shell execution should carry those boundaries into incident handling. Provider fallback is also permission-profile fallback.

Claude's June 2 incident was not accompanied by a public root-cause report, so the timeline is more useful than speculation. The 278 minutes between identified at 06:39 and monitoring at 10:42 are enough to ask how an agent runner should behave while the provider is degraded. Should it start new work? Should it pause writes? Should it show users a saved checkpoint? Should it continue only read-only tasks? Those are product decisions that cannot be improvised during the outage.

Developers can audit their systems with four immediate questions. Does the agent store the last tool call and distinguish it from external side effects? Can the same task restart from an idempotency key or branch checkpoint? Is fallback allowed by task class rather than globally? Are provider status and internal queue state shown in the same user-visible place? If the answer is no, the next model outage will be handled by ad hoc human memory.

This is not a Claude-only conclusion. OpenAI, Google, Anthropic, Azure, AWS, and smaller model providers all operate status pages because frontier model services depend on deployment, routing, capacity, policy, billing, and tool integration layers. As agents spend more time inside work systems, model outages become part of workflow design rather than rare external API failures.

The practical takeaway is not to change providers every time a status page turns yellow. It is to reduce the amount of execution state stored in a single provider, single model, and single transcript. Claude Status gave the industry a 5 hour 45 minute window to examine. Agent products should answer what they preserve, what they retry, what they hand to a human, and where they resume when that window appears again.