SRE agents stall at 47% on real Kubernetes incidents
Artificial Analysis ITBench-AA shows that even leading SRE agents remain below 50% on Kubernetes root-cause analysis, exposing the reliability gap in operational AI.
- What happened: Artificial Analysis published ITBench-AA, an SRE-agent benchmark based on IBM ITBench.
- The evaluation runs 59 Kubernetes incident tasks three times each and checks whether the agent identifies root-cause entities in structured JSON.
- The numbers: The public leaderboard tops out at 46.7% for Claude Opus 4.7, followed by GPT-5.5 at 45.8% and Qwen3.7 Max at 42.5%.
- Why it matters: SRE agents need to be evaluated on operational diagnosis, not conversational polish.
- The scoring gives zero credit when a false negative remains, so a plausible incident story is not enough if the agent misses a real cause.
- Watch: ITBench-AA is an offline snapshot benchmark, so it does not prove readiness for live on-call permissions, rollback approval, or real-time remediation.
The dangerous illusion around AI in operations is that a fluent explanation means competent incident response. During an outage, fluent prose does not restore the service. The useful work is narrower and harsher: decide which deployment is misconfigured, which namespace contains the faulty ConfigMap, which pod is only a symptom, and which network policy is actually contributing to the failure. That judgment usually has to be made from incomplete alerts, events, traces, metrics, and topology fragments.
The Artificial Analysis ITBench-AA leaderboard, published around May 27, 2026, aims directly at that gap. It is an independent implementation of IBM's ITBench focused on Kubernetes SRE incident root-cause analysis. The headline is cold. On the public page, the best score is Claude Opus 4.7 Adaptive Reasoning Max Effort at 46.7%. GPT-5.5 xhigh follows at 45.8%, and Qwen3.7 Max reaches 42.5%. Even frontier-model agents do not clear half the tasks.
That number is too useful to reduce to a joke about models being bad at operations. The sharper point is that the evaluation target is changing. In coding benchmarks, the question is whether a patch passes tests. In customer-support benchmarks, it is whether the assistant follows policy and resolves the user problem. In SRE benchmarks, the question is whether the agent can name the true causal entities without missing a cause or over-attributing symptoms. That difference changes agent product design, approval boundaries, and the cost of failure.

ITBench-AA does not ask for an incident explanation
According to Artificial Analysis, ITBench-AA independently implements the SRE portion of IBM ITBench. The agent receives an offline Kubernetes incident snapshot containing alerts, events, traces, metrics, and application topology. It can inspect the snapshot through shell access, then writes its final diagnosis as a structured JSON output file. The answer is not a human-friendly postmortem. It is a list of Kubernetes entities that actually caused the incident.
The benchmark contains 59 tasks. Artificial Analysis says 40 come from IBM's public release and 19 are private tasks shared by the ITBench team. Each task is repeated three times. The execution environment is built on Stirrup, Artificial Analysis's open-source agent harness for sandboxed code execution. In other words, this is not a chat prompt that asks "what caused the outage?" It is closer to a constrained investigation loop where the agent has files, commands, and a final schema to satisfy.
The scoring also looks like operations work. The headline metric is average precision at full recall. If the diagnosis has no false negatives, the task score is TP / (TP + FP). If a false negative remains, the task gets zero. Missing even one real cause is fatal. Adding unrelated entities also hurts precision. That mirrors SRE pressure. If the true cause is missed, recovery slows down or the same incident returns. If every noisy symptom is treated as causal, mitigation becomes wider and riskier than necessary.
Source: Artificial Analysis ITBench-AA leaderboard description and result summary
The sample tasks show why this is hard
The public examples explain why the benchmark is not a simple observability summary test. In task 3, a feature-flag misconfiguration in the OpenTelemetry Demo application increases CPU and latency for the ad service. The visible symptoms appear around the ad deployment and pods. The expected root cause, however, is not the ad deployment. It is the flagd-config ConfigMap in the otel-demo namespace. If an agent names the "slow service" as the cause, it misses the actual causal entity.
Task 16 is a shipping-service incident where the service points to the wrong quote-service port. The QUOTE_ADDR environment variable has been changed to quote:0000, and checkout traces show connection failures. The agent has to read the caller chain and find the shipping Deployment with the wrong environment variable. Simply choosing the service with the most visible trace failures is not enough.
Task 20 involves a product-catalog Deployment that references a nonexistent container image and therefore cannot become Ready. A KubePodNotReady alert appears, and dependent services fail downstream. The expected root cause is not one of those downstream services. It is the product-catalog Deployment with the invalid image reference. All three examples share the same structure: the observed symptom and the real cause are separated. A useful SRE agent is not the one that reads the most logs. It is the one that narrows the causal graph.
That is stricter than many "AI SRE" demos. Product demos often summarize alerts, create Slack threads, find runbooks, and explain probable causes. Those are useful capabilities. ITBench-AA asks for a lower-level and more unforgiving output: the exact causal entities. That distinction sets the deployment boundary. A summarization agent can start as a read-only assistant. An RCA agent that suggests remediation needs precision and recall. An agent that can roll back or patch configuration needs a much higher reliability threshold.
The original paper warned about this in 2025
ITBench-AA is a new leaderboard, but the foundation goes back to the IBM ITBench paper, released in February 2025. The paper argues that critical IT automation needs measurable and understandable effectiveness, not just general model capability. The initial release covered SRE, CISO, and FinOps tasks. In the arXiv abstract, ITBench included 94 real-world scenarios, and then-current state-of-the-art model agents reached resolution rates of 13.8% for SRE, 25.2% for CISO, and 0% for FinOps.
Those figures should not be directly compared with ITBench-AA's 46.7%. The task mix, harness, model generation, and scoring design are different. The directional message still matters. In 2025, the original paper showed that enterprise IT automation is much rougher than general QA or code-generation tests. In 2026, ITBench-AA narrows the problem to Kubernetes SRE RCA and puts newer frontier models on a common leaderboard. The result has improved, but it is still far from the point where teams can comfortably hand over operational authority.
The IBM ITBench GitHub repository also clarifies the benchmark's intent. ITBench provides Kubernetes-based scenario environments, push-button deployment tooling, SRE and CISO baseline agents, and a managed leaderboard. Its use cases cover availability and resiliency for SRE, compliance and security enforcement for CISO, and cost and ROI optimization for FinOps. The question is not whether a model knows operations vocabulary. The question is whether an agent can perform operational tasks in an environment.
| Evaluation axis | Original ITBench paper | ITBench-AA |
|---|---|---|
| Scope | 94 SRE, CISO, and FinOps scenarios | 59 SRE Kubernetes RCA tasks |
| Input | Kubernetes-based scenario environments | Offline incident snapshots with shell access |
| Output | Task resolution and domain metrics | Root-cause entity JSON diagnosis |
| Core signal | Enterprise IT automation still has low resolution rates | Modern frontier agents remain below 50% |
Sources: IBM ITBench paper, IBM ITBench GitHub, and Artificial Analysis ITBench-AA description
Why the top models are still below 50%
The first explanation is the task itself. Kubernetes incident RCA is not a single-answer question. The agent has to explore multiple files and signals inside the snapshot. Alerts can point to symptoms. Events can be closer to causes, but they include noise. Traces show propagation paths, not necessarily the first cause. Topology shows dependencies, but evidence is still needed to decide which dependency actually mattered in this incident.
The second explanation is the instability of the agent loop. The same model can reach a different diagnosis depending on which command it runs first, which logs it reads longer, and what it discards in intermediate summaries. Under a scoring rule that turns false negatives into zero, a small exploration miss becomes decisive. In a general chat benchmark, a roughly correct explanation may still receive partial credit. In SRE RCA, one missed entity can make the whole diagnosis fail.
The third explanation is cost and time pressure. Artificial Analysis also exposes axes such as average turns, output tokens, and cost per task. In production operations, longer investigation increases both cost and incident response time. An agent near a 100-turn cap leaves humans waiting while the incident continues. An agent that jumps too quickly may mistake a symptom for the cause. SRE agents need to optimize accuracy, latency, token budget, and escalation timing together.
The fourth explanation is that the benchmark measures the agent system, not only the model. Allowed shell commands, workspace structure, observation compression, final JSON validation, retries, and harness design all influence the result. That matches the broader pattern in agent benchmarks. Stronger models matter, but the environment protocol and agent harness can change the same model's practical performance.
Teams need automatic verification before automatic repair
ITBench-AA does not imply that AI has no place in SRE. The practical conclusion is narrower. The first useful deployment target is investigation assistance and verification, not autonomous remediation. An agent that groups alerts, collects related traces and events, proposes candidate root causes, and packages evidence for human review can already reduce response time. Automatic rollback, configuration patches, and network-policy edits need separate approval boundaries.
The first lesson for SRE teams is to avoid receiving final output only as postmortem prose. Diagnosis should be structured. Entity type and name should be separated: namespace, Deployment, Service, Pod, ConfigMap, NetworkPolicy, and so on. Evidence should be attached. The system should preserve which commands the agent ran, which files it read, and which signals supported the conclusion. That makes human review faster and turns incidents into regression tests later.
The second lesson is to build test sets that separate symptoms from causes. Many production incidents begin with noisy symptoms. The service with high CPU may not be the root cause. The endpoint with the worst error rate may be downstream. Feature flags, environment variables, secrets, image tags, rollouts, network policies, DNS, quotas, storage, and dependency outages all need synthetic incident coverage. The value of ITBench is not only the public ranking. It provides an evaluation grammar for operational automation.
The third lesson is to define an SLO for the agent before placing it inside service SLO workflows. A team should decide whether, for a P1 incident, the agent must surface a candidate cause within three minutes, provide confidence and evidence within ten minutes, or escalate immediately when evidence is thin. Some situations benefit from a slower but more exhaustive agent. Others only need fast triage. One generic "AI SRE" layer should not be expected to cover every response stage.
The benchmark has limits
ITBench-AA is an important signal, but it is not the whole story of operational reliability. First, it is an offline snapshot benchmark. In real production, metrics change over time, other engineers take action, new alerts appear, and rollback windows shrink. Snapshots are necessary for reproducibility and comparison, but they cannot capture every live-incident interaction.
Second, diagnosis and remediation are different problems. An agent that identifies the cause may still make an unsafe fix. Conversely, an agent that does not achieve perfect RCA may still create a useful evidence bundle that shortens human response time. A leaderboard score is part of product value, not the whole thing. Teams should treat ITBench-AA as a starting point and extend it with their own runbooks and incident classes.
Third, the public results reflect a specific combination of model, effort level, harness, and task set. Models change quickly, and agent frameworks are evolving just as fast. Higher reasoning effort may improve accuracy while increasing cost and latency. In operations, the most useful result is not the global top score. It is repeatable performance on the company's incident classes within its budget and escalation rules.
Community skepticism fits that balance. Some benchmark discussions argue that even a better leaderboard does not provide the trust profile of the specific agent instance that will handle tomorrow's deploy. That is fair. But it does not make benchmarks useless. Good benchmarks reduce marketing fog and expose failure modes in a comparable form.
The next bottleneck for SRE agents
Agentic development is moving from code generation into operational workflows. Coding agents create pull requests, observability tools produce agent timelines, and cloud providers are building sandboxed agent runtimes. The natural next question is whether incident response can also be handed to agents. ITBench-AA's answer is cautious: agents can help, but they are not yet autonomous operators.
The next bottleneck is not only model size. It is how agents handle operational evidence. The needed system is not an agent that skims a file tree and writes a confident paragraph. It is an agent that explicitly builds topology and trace dependencies, forms hypotheses, searches for disconfirming evidence, and narrows candidate root causes. Retrieval over runbooks and prior incidents matters, but validating retrieved knowledge against the current snapshot matters more.
Human handoff is another bottleneck. An incident commander does not only want "product-catalog is probably the issue." A better handoff is: the product-catalog Deployment has an invalid image reference, the KubePodNotReady alert and downstream checkout failure are consistent with that, and the agent found no supporting evidence for ConfigMap or network-policy causes. That sentence is long, but it can be structured. Agent interfaces should show evidence graphs and alternative hypotheses, not just confidence scores.
The final bottleneck is organizational evaluation data. ITBench-AA gives the public market a reference point, but each company's Kubernetes clusters, naming conventions, observability stack, deployment patterns, and incident taxonomy differ. Real adoption of AI SRE requires internal benchmarks built from past postmortems and synthetic failure injection. The agent has to learn failure modes that general Kubernetes knowledge cannot provide.
The 47% number can look disappointing. It is useful precisely because it reveals where SRE agents remain unreliable. Operations is not a test of eloquence. It is a test of causal tracing, evidence linking, complete diagnosis, and safe handoff. ITBench-AA puts that reality on a public leaderboard. The question for engineering teams is not only which model ranks first. It is whether their own operational automation has been tested against this kind of standard.