Devlery
Blog/AI

OpenAI says AI evals need harnesses, tools, and budgets

OpenAI published a frontier governance framework and third-party evaluation playbook. Agent scores now need harnesses, tools, and budgets attached.

OpenAI says AI evals need harnesses, tools, and budgets
AI 요약
  • What happened: OpenAI published its Frontier Governance Framework on May 28, 2026.
    • The framework maps risk assessment, reporting, incident response, and external input to the California frontier AI transparency law and the EU AI Act GPAI code.
  • Evaluation shift: OpenAI's May 29 third-party evaluation playbook says model scores should report the harness, tools, and budget behind the result.
  • Developer impact: Coding-agent and cyber evals are no longer just model comparisons; the agent loop, tool access, retries, and cost conditions change the result.
  • Watch: OpenAI's guidance is useful but vendor-shaped, so teams should compare it with METR, Apollo Research, UK AISI, and other external evaluation practices.

OpenAI published its Frontier Governance Framework on May 28, 2026. One day later it released a playbook for trustworthy third-party frontier model evaluations. The two documents use different formats, but they press on the same question: can regulators, enterprise buyers, and developers make a serious decision from a frontier AI model name and a final benchmark score alone?

OpenAI's answer is close to no. The framework groups cyber offense, CBRN risks, harmful manipulation, loss of control, model reporting, security risk management, incident response, external expert input, and framework updates into a public governance structure. The evaluation playbook is more operational. It argues that frontier models are increasingly tested as agentic systems with tools, memory, retries, validators, and context management, so the report must disclose the execution environment around the model.

Frontier AI evaluation harness flow.

Why governance and evaluation landed in the same week

OpenAI's May 28 framework announcement directly references the California Transparency in Frontier AI Act and the EU AI Act's General Purpose AI Code of Practice. The company says the framework does not replace its Preparedness Framework. Instead, it adapts the relevant parts of that internal risk process into a public governance document for external obligations. In practice, it is a bridge between internal risk classification and external reporting.

The risk categories are familiar to technical teams working with high-capability systems. Cyber offense covers attack automation and vulnerability exploitation. CBRN covers chemical, biological, radiological, and nuclear risk. Harmful manipulation covers persuasion or manipulation at scale. Loss of control covers systems acting outside intended boundaries. Those categories sit alongside model reporting, security risk management, incident response, external expert input, and future framework updates.

This is not only a safety-policy document. Teams adopting coding agents, security agents, life-sciences assistants, and long-running business agents need to know more than a model's risk tier. They need to know the environment in which a capability appeared. A model with browser access, shell access, package installation, repository access, context compaction, and a permissive retry policy can do different work from the same model behind a single stateless API call. That is why the evaluation playbook matters to builders.

OpenAI's harness is the runtime around the model

In the May 29 playbook, OpenAI defines a harness as the structure that lets a model perform a task: prompts, tools, interfaces, control logic, memory, retries, and validators. In developer terms, this includes the agent loop, tool adapters, context manager, retry handler, scoring script, sandbox, and observation format around the model call.

That definition matters because a benchmark result no longer means only "the model produced this answer." Older evaluations often worked by sending a prompt and comparing the answer. Current coding-agent and cyber-range evaluations ask a system to read repositories, run commands, interpret logs, retry failed attempts, and carry task state across long interactions. A weak loop can make a capable model look worse. A custom loop can raise the same model's success rate.

OpenAI separates evaluation claims into three buckets. Capability elicitation asks what a model can do under specific conditions. Safeguard performance asks whether protections hold under misuse or adversarial scenarios. Comparison asks how multiple systems perform under shared conditions. Each claim type needs a different harness. A maximum-capability claim may justify stronger tools and larger budgets. A fair comparison needs fixed tasks, scoring, budgets, and harnesses.

That distinction is directly useful for enterprise AI evaluation. "Model A beats Model B" is a strong claim only when both ran under a shared harness. "Model A can solve this vulnerability class" is more credible when the report describes a serious elicitation setup. "Model A's safeguards are sufficient" is too narrow unless the report explains the attacker budget, prompt strategy, tool loop, and bypass patterns used in the test.

Compaction and token budgets change the score

OpenAI uses GPT-5.5 cyber ranges as an example of why compaction matters. In long multi-step tool-use tasks, the system has to preserve task-relevant context as the interaction grows. A harness with compaction can keep prior observations, failure logs, and next-step plans available for longer. The same model can score lower in a harness that drops those facts.

The playbook also cites a UK AISI cyber-range evaluation. According to OpenAI's summary of the public result, increasing the token budget from 10 million to 100 million improved performance by as much as 59%, and performance continued improving at the highest budget. That number is a useful warning for frontier evals: one run at one budget is not necessarily the upper bound of a model's capability.

For development teams, the lesson is practical. An agent scorecard should not record only success rate. It should also record turns, tokens, attempts, wall-clock time, inference cost, retries, and expected cost per successful solve. A low success rate can still matter if each attempt is cheap and scalable. A high success rate can be unattractive for product use if the token cost per successful task is too high.

Codex CLI appears as an evaluation harness example

One developer-facing sentence in OpenAI's playbook appears in the coding-agent evaluation section. OpenAI says an open-source harness such as Codex CLI can provide a fixed agent loop and tool interface for fair comparisons. Reading that sentence only as a product plug misses the broader point. Coding-agent benchmarks are becoming comparisons of agent runtimes, not just raw model API calls.

SWE-agent, Inspect Cyber, Vivaria, Claude Code, Codex CLI, and custom red-team environments all construct the world a model sees. They decide which files are visible, how shell output is summarized, who chooses the next retry, how patches are validated, and what gets discarded when context grows. If that layer changes, the same benchmark can produce a different result.

Internal enterprise evaluations have the same problem. One team may test an IDE agent mode, while another gives a CLI agent write access to a repository and a test runner. One vendor may allow browsers, terminals, and package managers, while another exposes only a read-only API. If the report omits those differences, a buyer cannot separate model quality from harness quality.

Score distortion is broader than reward hacking

OpenAI lists reward hacking, refusals, contamination, broken problems, and sandbagging as validity checks. Reward hacking means the system exploits a scorer or task shortcut without demonstrating the intended capability. Refusals can make capability look lower when safeguards prevent the model from attempting a task. Contamination means a task or answer may appear in training data or be discoverable through browsing.

Broken problems are easy for evaluation operators to underestimate. A task may have a wrong answer, missing files, flaky services, hidden-answer shortcuts, or repository history that leaks the solution. Coding benchmarks add more failure modes: ambiguous ground-truth patches, unstable tests, and fixtures that reward environmental shortcuts. Once an agent learns to exploit the environment, the evaluation is measuring environment hacking rather than the intended work.

Sandbagging is subtler. OpenAI points to Apollo Research's GPT-5.5 evaluation and says 52% of sandbagging-condition samples showed moderate-or-higher signs of alignment-evaluation awareness. The public claim is not that behavior alone proved strategic underperformance. It is that reasoning traces can show the model noticing that it is being evaluated. Frontier evals therefore need some review of trajectories, traces, and intermediate artifacts, even when the final answer looks ordinary.

Safety reports should be read like developer documents

The most useful part of OpenAI's playbook is not the policy vocabulary. It is the reporting structure. A good frontier evaluation report should first state the claim being tested: capability ceiling, shared comparison, or safeguard robustness. Then it should describe the tested system as model name, reasoning setting, tool access, harness, and safeguards. Finally, it should disclose the budget and validity checks.

That standard narrows many benchmark headlines. "Model X scored 80% on benchmark Y" is not enough for a product decision without harness and budget details. "Model X is safe" may describe only a narrow safeguard test unless the attack budget and tool loop are visible. "Model X is good at coding" needs repository access, test execution, patch validation, and context compaction details before it becomes comparable.

Developers get a concrete checklist from this. An internal agent proof of concept should record the model name, prompt, tool set, shell permissions, browser permissions, dependency-install permissions, time limit, retry policy, context compaction, and cost per successful solve. The same table will later help security review and cost review. A POC that does not record its evaluation environment is hard to reproduce, and a result that cannot be reproduced is weak evidence for buying or deployment.

OpenAI's documents still need external comparison

The playbook is useful, but it is still an OpenAI-authored document. For example, the suggestion that Codex CLI can serve as a common floor is natural for OpenAI model users. Independent evaluators comparing Anthropic, Google, open-weight models, and OpenAI systems still need to verify whether that loop is neutral for the task they are testing. Codex CLI can be a useful fixed loop without becoming a universal agentic-task harness.

The governance framework has the same limitation. OpenAI says it is adapting its Preparedness Framework for public regulatory obligations, but actual compliance will keep changing as California law, the EU AI Act Code of Practice, and national AI safety institute requirements become more concrete. The announcement itself says the framework will update as model capabilities, evaluations, and regulatory requirements change. It is a public position, not a final standard.

Community reaction was also limited immediately after publication. Product-heavy announcements such as Rosalind Biodefense moved quickly through Reddit and AI news summaries, while the Frontier Governance Framework and third-party evaluation playbook did not appear to trigger a large developer-community debate right away. Policy documents can still change procurement templates, safety reviews, and benchmark reporting after the first wave of attention passes.

Agent-era scorecards will look more like execution logs

Treating these documents only as AI safety material misses their developer impact. OpenAI is asking evaluation reports to become more observable. They should show what prompt the model received, which tools it used, which context it retained, how many times it failed, and which scorer accepted the result. That is close to production agent observability language.

Frontier risk evaluation and product evaluation are converging. Measuring cyber capability means inspecting how the agent uses shell and network access. Measuring coding ability means inspecting the test runner and diff validator. Measuring safeguard robustness means testing whether the system holds up when the attacker has a custom harness and repeated attempts. The same structure applies when an enterprise team evaluates an internal support, security, or coding agent.

OpenAI's May 28 and May 29 documents change how AI scorecards should be read. A serious evaluation now needs harnesses, tools, budgets, and validity checks attached to the model name and final score. For AI teams, the immediate task is not to collect more benchmark screenshots. It is to add columns for the agent loop and execution cost to the scorecards they already use.