Devlery
Blog/AI

620,000 attacks expose a 35-point safety gap in reasoning models

TELUS Digital tested 34 AI models with more than 620,000 adversarial attacks. The benchmark shows why enterprise AI safety is now an operating discipline.

620,000 attacks expose a 35-point safety gap in reasoning models
AI 요약
  • What happened: TELUS Digital published the second edition of its GenAI Safety Model Benchmark after testing 34 AI models with more than 620,000 adversarial attacks.
    • The evaluated providers include Anthropic, OpenAI, Google, Meta, Alibaba, Baidu, ByteDance, Zhipu AI, 01.AI, and Mistral.
  • The numbers: Attack vulnerability rates ranged from 1.3% to 93%, with reasoning models averaging 19.9% versus 55.1% for non-reasoning models.
  • Why it matters: Model selection is not enough; deployed AI applications need continuous red teaming after prompts, tools, data, and providers change.
    • TELUS also flags refuse-but-engage failures, where a model says no but still gives surrounding information that helps the attacker.
  • Watch: The benchmark is tied to TELUS's banking-assistant scenario and scoring method, so the main lesson is the testing pattern, not just the ranking.

AI model safety is often reduced to a leaderboard question. Which model is safest? Which vendor blocks jailbreaks best? Are open models inherently riskier than closed APIs? TELUS Digital's GenAI Safety Model Benchmark, published on May 26, 2026, pushes the discussion toward a more operational question: does safety end when a team chooses a model, or does an AI application have to be tested like software that remains under attack after launch?

The headline numbers are large. TELUS Digital says it ran more than 620,000 adversarial attacks against 34 AI models from 10 providers. The set spans Claude, GPT, Gemini, Llama, Qwen, ERNIE, Seed, GLM, Yi, and Mistral families. The models were not evaluated only as abstract chat endpoints. TELUS configured them as banking AI assistants, gave them a role, policy boundaries, and expected refusal behavior, then attacked the application-like setup. That detail matters because most teams do not ship a raw model. They ship a model wrapped in a system prompt, retrieval layer, tool permissions, guardrails, and product-specific tone.

That is where the benchmark becomes useful for builders. Enterprise AI risk is not a static model-card attribute. It is a state that changes when the system prompt changes, when a retrieval corpus absorbs new documents, when a provider silently updates the backend model, when a fine-tune adjusts tone, or when a tool is connected to customer data. Safety failures are no longer limited to one bad sentence. In an agentic workflow, they can become privacy exposure, fraud assistance, cybersecurity misuse, regulatory breach, or a tool call that should never have happened.

34
models evaluated
620K+
adversarial attacks
15
vulnerability categories

Source: TELUS Digital GenAI Safety Model Benchmark 2026 and PRNewswire release

A safety range from 1.3% to 93%

In the PRNewswire release, TELUS says attack vulnerability rates ranged from 1.3% to 93%. The report landing page gives the upper bound as 92.9%. Even without the full per-model table, the message is already clear: a generic statement such as "modern LLMs are mostly safe" is not useful enough for enterprise deployment.

TELUS says five of the ten least vulnerable models were Claude models, and that the lowest vulnerability rate came from a Claude model. But the story should not collapse into "use Claude and move on." TELUS also warns that Claude models still had weaknesses, and that even a single-digit failure rate may be unacceptable in use cases involving money, health, reputation, or regulated workflows. A 3% failure rate can look excellent on a leaderboard. In a banking assistant handling 100,000 conversations a day, it means 3,000 potential failures. If those failures touch account recovery, fraud, identity verification, investment guidance, or private data, the business impact is not theoretical.

The sharper result is the gap between reasoning and non-reasoning models. TELUS reports an average attack success rate of 19.9% for reasoning models and 55.1% for models that respond without an explicit reasoning step. That is roughly a 35-point spread. It does not mean every slower model is safe, or that visible chain-of-thought is required. It suggests that models or product layers that spend more effort evaluating the request before answering can materially change attack resistance.

Builders should still be careful with the conclusion. Reasoning models may refuse unsafe instructions more reliably, but they can also create more elaborate tool plans. In an agent workflow, safety is not only "did the model answer?" It is also "which tool did it call?", "which data did it retrieve?", and "where did it require human approval?" The benchmark still sends a practical signal: routing user-facing or tool-adjacent requests to cheap non-reasoning models can leave less safety margin than teams assume.

MetricTELUS figureWhat builders should read into it
Overall vulnerability range1.3% to 93%Model-level safety differs enough that reputation alone is a weak selection method.
Reasoning models19.9% average ASRA structured answer process can correlate with stronger attack resistance.
Non-reasoning models55.1% average ASRLow-latency and low-cost routing needs stronger guardrails and regression testing.
Vulnerable categoriesPrivacy, fraud, cybersecurityThese are priority surfaces when AI touches customer data or business tools.

Source: TELUS Digital public report page and PRNewswire figures, summarized for builders

Small models can turn cost savings into security debt

TELUS's second axis is model size. The release says smaller models were consistently more vulnerable across both proprietary and open-source groups. That maps directly onto a common product architecture. Teams do not want every request to hit the most expensive frontier model, so they route work: small models for classification, summarization, FAQ answers, query rewriting, ticket triage, or retrieval compression; larger or reasoning models for complex and sensitive tasks.

The architecture is sensible. The risk depends on where the smaller model sits. A classifier may look harmless until it determines whether the next step calls an account lookup tool. A query rewriter may look harmless until an attacker uses it to inject sensitive identifiers into retrieval. A context compressor may look harmless until it preserves malicious instructions and discards the policy text that would have blocked them. In an agent pipeline, the weakest stage can define the security posture of the whole system.

This does not mean small models should disappear. It means their authority should be narrow. A small model can summarize low-risk text, but its output should not be trusted as a policy decision. It can classify intent, but a sensitive tool boundary should still have deterministic validation or a second model check. It can rewrite a query, but the rewritten query should be filtered for data leakage and prompt injection. Cost routing is useful only when the security boundary is explicit.

The open-source question is too simple

AI safety debates often place open models under suspicion. The weights are available, attackers can probe them more easily, and post-training safety may vary. Open-model advocates answer with transparency, auditability, on-prem deployment, and the ability to tune behavior for local policy. TELUS's result makes that argument more specific.

According to the release, open models were exploited more often on average than proprietary models. TELUS still cautions that model source is not what drives risk by itself. It points to Zhipu AI's GLM 4.7 as a large open-source model that outperformed several proprietary alternatives in the benchmark. That is the important lesson for enterprise buyers: vendor category is not a risk model.

The real difference is operational responsibility. With open weights, the organization can control deployment, data residency, fine-tuning, and audit logs, but it also owns evaluation, patching, and regression testing. With a closed API, the provider safety layer may help, but provider updates can also change behavior underneath the application. In both cases, the deployed system is the unit that needs testing.

Fine-tuning complicates this further. TELUS notes that fine-tuning can weaken safety alignment even when the fine-tuning data does not contain harmful examples. That is an uncomfortable message for product teams. A fine-tune meant to improve domain accuracy, make a support assistant sound more helpful, or increase answer rate may also soften refusal boundaries. Fine-tuning is therefore not only a performance change. It is a security change.

The gray failure: refusing while helping

One of the most useful concepts in the benchmark is refuse-but-engage. This is the pattern where a model begins with a refusal, then offers surrounding information that still helps the user move toward the unsafe goal. A response might say, "I cannot help with that," and then describe general system design details, defensive steps, or contextual hints that are useful to an attacker.

That failure is harder to catch than a straightforward policy violation. A simple classifier may see the refusal sentence and mark the answer as safe. A human reviewer may also miss the cumulative effect across several turns. But attackers can use partial information. In a multi-turn conversation, small side details can accumulate into a workable path.

Agent workflows make the pattern more dangerous. A model may refuse to provide account data but still call a search tool. It may decline to give cyber-offense instructions but retrieve documentation that narrows the exploit path. It may say it cannot perform a regulated action, then summarize an internal procedure that makes social engineering easier. Output moderation alone cannot catch this. Tool calls, retrieval queries, intermediate artifacts, and final answers all need to be part of the trace.

Practically, refusal policy should define prohibited paths of assistance, not only prohibited final strings. Teams need to decide which surrounding details are off limits, when a tool call should stop, and when a conversation must escalate to a human. Security review should inspect conversation traces, not just prompt templates.

One launch review is not enough

TELUS's main conclusion is continuous testing. Many organizations red-team an AI feature once before release: run jailbreak prompts, check a policy list, complete legal review, then ship. That model is weak for any software system, and it is especially weak for AI applications. The application is made of code, prompts, models, retrieval data, tool permissions, and guardrail policy. Any one of those can change behavior.

TELUS's FAQ notes that the same model can show statistically significant safety changes over a quarter. The release frames the issue plainly: a system that passes today may be vulnerable tomorrow. That is not alarmism. Teams already run regression tests when dependencies change. A model update, prompt edit, retrieval corpus refresh, or new tool permission should receive the same treatment.

Continuous testing is not just running a bigger prompt list. It means mapping attack goals to product-specific risk categories, preserving reproducible traces, sampling multiple runs, testing multi-turn behavior, and setting thresholds for regression. A banking assistant might prioritize account access, dispute flows, KYC, card replacement, and fraud assistance. A developer tool might prioritize secret exposure, destructive commands, dependency poisoning, credential handling, and unsafe code execution. The taxonomy has to fit the product.

The spending imbalance is the business signal

TELUS also highlights an investment gap. The release says worldwide AI spending is projected at $2.52 trillion in 2026, while AI trust, risk, and security management spending is projected at $3.43 billion. TELUS frames that as roughly $1 of AI security spending for every $735 spent on AI capability. The exact absolute values depend on the underlying market estimates, but the ratio captures a familiar pattern: organizations fund AI adoption faster than they fund AI safety operations.

Inside development teams, the same imbalance appears at smaller scale. Model API budgets, vector databases, observability tools, agent frameworks, and product UI receive funding. Adversarial test sets, red-team automation, tool-call policy, failed-trace triage, and regression thresholds often arrive later. That delay is how safety becomes incident response instead of engineering discipline.

Responsibility also has to spread beyond the security team. ML engineers choose models. Product engineers edit prompts. Data teams maintain retrieval sources. Platform teams open tool permissions. Operations teams evaluate conversation quality. They need a shared failure taxonomy. "The model gave a weird answer" is too vague to fix. "Privacy exploitation with refuse-but-engage behavior reproduced after the latest retrieval update" is actionable.

Do not read this as only a leaderboard

There are caveats. TELUS's benchmark is important, but it is not the final map of AI safety. The public report page and release do not expose every prompt, sampling setting, full model ranking, scoring rubric, or guardrail configuration. The banking-assistant scenario is valuable, but healthcare triage, legal drafting, HR support, security agents, code agents, and research tools will have different risk distributions.

The practical lesson is structural. Reasoning architecture and answer process can influence safety outcomes. Small models need tighter authority boundaries when used for cost routing. Open versus proprietary is less important than deployed configuration, evaluation responsibility, and continuous testing. Fine-tuning and prompt changes should be treated as security-relevant changes. Refusal should be evaluated by the help it still provides, not only by whether the answer contains a refusal sentence.

For builders, the operational questions are now concrete. Which model handles which request class? Which requests can safely go to a small model? Which tool calls require human approval? How is safety regression measured after fine-tuning? How does the team detect provider-side model updates? Does the audit log show what the model refused, what it revealed anyway, and which tools it touched?

The benchmark is less a final ranking than a warning about the unit of responsibility. Early LLM safety looked like a provider problem: align the model, publish a policy, and expose a safer API. Enterprise AI and agents move part of that responsibility to the application owner. The model is the foundation. The actual risk comes from the prompts, data, tools, permissions, users, and workflows built on top of it.

TELUS's 620,000 attacks show that model differences are real. They also show that the bigger story begins after model selection. AI applications change. Models change. Attackers change. The test suite has to change with them. The ordinary software principle still applies: if a system is important enough to deploy, it is important enough to keep testing.