Devlery
Blog/AI

Vera Ships First 88-Core CPU Systems to OpenAI for Agent Workloads

NVIDIA delivered the first Vera CPU systems to OpenAI, Anthropic, SpaceXAI, and OCI. The launch points to CPU bottlenecks in tool calls, sandboxes, Python, and long-running agents.

Vera Ships First 88-Core CPU Systems to OpenAI for Agent Workloads
AI 요약
  • What happened: NVIDIA delivered the first 88-core Vera CPU systems to Anthropic, OpenAI, SpaceXAI, and Oracle Cloud Infrastructure.
    • The public announcement landed on May 18, 2026, and Vera is NVIDIA's first internally designed data center CPU.
  • Why builders should care: Agent workloads are exposing CPU-side bottlenecks in Python, tool calls, sandboxes, orchestration, and KV cache handling.
  • Watch: Early Phoronix benchmarks were strong, but NVIDIA limited the test set and did not expose power, frequency, or pricing data.

NVIDIA said on May 18, 2026 that it had delivered the first Vera CPU systems to Anthropic, OpenAI, SpaceXAI, and Oracle Cloud Infrastructure. At first glance this looks like another server-chip milestone. For AI builders, the narrower question is more useful: when agents run code, call tools, read files, start browsers, and keep state inside sandboxes for minutes at a time, the bottleneck does not end at the GPU. Vera is NVIDIA's first internally designed data center CPU, and NVIDIA is explicitly positioning it for that off-GPU work.

The delivery schedule was specific. NVIDIA said the first shipments reached Anthropic, OpenAI, and SpaceXAI on Friday, May 15, 2026, followed by OCI on Monday, May 18. Ian Buck, NVIDIA's vice president for hyperscale and high-performance computing, personally delivered the systems. NVIDIA calls Vera a CPU for "agentic AI." That phrase can sound broad, but the listed workloads are concrete: agent sandboxes, tool calls, orchestration, long-context state management, data analysis, and reinforcement learning pipelines all wake the CPU between GPU-heavy inference steps.

Official NVIDIA photo of the first Vera CPU customer delivery

The headline specifications aim straight at data center CPU competition. NVIDIA's product page lists 88 NVIDIA-designed Olympus cores, up to 176 threads, Armv9.2 compatibility, up to 1.2TB/s of LPDDR5X memory bandwidth, up to 1.5TB of memory, second-generation Scalable Coherency Fabric, NVLink-C2C, and confidential computing support. NVIDIA also claims Vera can run software environments up to 50% faster than existing CPU infrastructure while doubling energy efficiency. A Vera CPU Rack can combine up to 256 Vera CPUs and run more than 22,500 concurrent environments.

For developers, the first numbers to inspect are not only core count. Memory bandwidth, single-thread behavior, process startup, filesystem access, package installation, and many small control-flow decisions matter in agent systems. A coding agent that opens a browser, runs tests, generates Python, installs packages, reads logs, and folds tool output back into a model context is not just doing matrix multiplication. It creates many small processes and I/O edges around the model call. That is why Vera is being framed less as a CPU that "feeds the GPU" and more as a CPU that keeps agent execution environments alive.

OpenAI and Anthropic being on the first customer list is not just a reference-logo story. Both companies have put coding agents and long-running agents closer to the center of their 2026 product strategy. Claude Code, Cowork, Codex, and API-based tool-calling flows all require real execution environments beyond the model response. When an agent clones a repository, runs a test suite, reads logs, and applies patches, the GPU handles inference while the operating system, CPU, storage, and sandbox policy handle the work around it. Vera's first customers make sense because research labs and cloud operators are now dealing with the same execution-layer pressure.

OCI's statement makes the scale more explicit. In NVIDIA's blog post, OCI senior vice president Karan Batta said Oracle plans to deploy hundreds of thousands of Vera CPUs beginning in 2026. OCI described Vera as an architecture for high-throughput reasoning workloads in enterprise agentic AI infrastructure. That number still needs follow-through in procurement, deployment timing, and customer-facing instance types. Even so, "hundreds of thousands of CPUs" places Vera closer to an AI cloud supply-chain component than a research evaluation board.

NVIDIA Vera system delivery at OpenAI's Mission Bay office

Vera is also different from NVIDIA's earlier CPU story. Grace already existed, and cloud providers have operated Arm-based server CPUs for years. Vera's new signal is that NVIDIA is no longer relying on Arm Neoverse cores for this part of the system. It designed the Olympus core itself. Phoronix summarized Vera as using NVIDIA's own Olympus cores with Armv9.2 ISA support, FP8 support, 176 threads, 2MB of L2 cache per core, 164MB of shared L3 cache, PCIe Gen 6, and CXL 3.1.

Phoronix put early Linux benchmark numbers on top of the launch on May 26, 2026. Michael Larabel tested a Vera system at NVIDIA's Santa Clara headquarters and reported that, across the permitted workloads, Vera's geometric mean came in 10% ahead of AMD's EPYC 9575F 5.0GHz high-frequency processor. The same conclusion page summarized Vera at 1.63x Grace and 1.55x Intel Xeon 6980P. The tested workloads included compilation, Python, Java, ClickHouse, regular expressions, and compression. Those categories map more closely to agent infrastructure than to a pure GPU benchmark.

AreaOfficial claim or early resultStill unresolved
Architecture88 Olympus cores, Armv9.2, 176 threadsClock behavior, power draw, and cooling conditions by production server chassis
MemoryUp to 1.2TB/s of LPDDR5X bandwidthSustained performance and cost under large customer workloads
Early benchmark10% ahead of EPYC 9575F in Phoronix's geometric meanNVIDIA-selected workload scope, with no public power or frequency telemetry
Developer impactCPU headroom for compilation, Python, sandboxes, and tool executionPricing, cloud availability, and access beyond hyperscalers and top AI labs

The benchmark caveats matter as much as the performance claim. Phoronix noted that NVIDIA allowed only a specific set of workloads for this early round and did not allow CPU power consumption or frequency monitoring. The test platform was also a pre-release open-platform system, not necessarily the final tuning of a production server chassis. The careful reading is not "Vera beats every server CPU." It is that Vera produced competitive numbers against high-end x86 CPUs in the class of workloads NVIDIA wants to associate with agentic data centers.

Community reaction repeated the same caution. Reddit r/hardware and Phoronix commenters treated the performance as notable while asking about NVIDIA-selected benchmarks, missing power numbers, final server pricing, AMD EPYC Venice timing, and whether normal developers will ever get direct access. Some readers interpreted Vera less as a general-purpose CPU and more as a specialized CPU for Python-heavy orchestration, agentic inference, and reinforcement learning environments inside AI servers. That criticism also clarifies the target market: NVIDIA is not trying to make the most ordinary web server CPU first. It is trying to own the CPU slice of the AI factory.

Linux support is a positive signal for that ambition. Phoronix reported that Vera follows Arm server standards, uses ACPI rather than Device Tree, and already has a path through major AArch64 Linux distributions. GCC 16.1 and newer, along with LLVM Clang 21 and newer, include Olympus core optimization support. A fast AI server still becomes expensive to operate if kernel support, compilers, drivers, and distribution readiness lag behind the hardware. Vera appears to be entering with more upstream preparation than Grace had at launch.

For teams building agents, Vera's message sits outside the model comparison table. Costs are often tracked as input tokens, output tokens, GPU inference, storage, and network. The failures, however, often show up somewhere drier: tests take too long, browser sandboxes crash, package installation blocks progress, logs are incomplete, or multiple agents compete over the same workspace and file locks. These problems are closer to CPU time, I/O, isolation policy, and orchestration than to model quality. A faster model still feels slow when the execution environment stalls.

NVIDIA's pairing of Vera with reinforcement learning points in the same direction. RL workloads create many parallel environments, run evaluation loops, save state, and move between inference and model updates. NVIDIA says a Vera CPU Rack can run more than 22,500 concurrent environments. That figure is aimed at workloads that run many small environments at once, such as robotics simulation, code execution evaluation, and agent behavior evaluation. While the GPU calculates policies or model outputs, the CPU has to keep the environments and control flow moving.

KV cache management is another CPU-side task in NVIDIA's Vera story. Long-context and multi-turn agents are not just one large model call. Prior messages, file search results, tool outputs, generated code, and test logs move between the model context and external storage. NVIDIA says Vera handles data movement and control so GPUs remain utilized. Read without exaggeration, Vera is not the brain of the LLM. It is closer to the workbench and conveyor around the model, keeping the rest of the system supplied and observable.

OCI will be the important place to watch for cloud economics. Developer access depends less on the chip name than on instance types, hourly pricing, region availability, startup latency, storage attachment, network behavior, and container launch time. Phoronix also called out price and availability outside hyperscalers, AI companies, and large customers as open questions. If Vera remains mostly inside AI labs and large cloud racks, most developers will experience it indirectly through faster agent infrastructure rather than through an instance they choose by name.

The competitive field is broader than AMD and Intel. EPYC and Xeon are the direct server CPU comparisons. In AI agent infrastructure, Vera also competes with AWS Graviton, custom Arm CPUs at Google and Microsoft, data-center-specific CPUs, serverless sandboxes, and the orchestration layers built around GPU clouds. A developer usually does not ask "Vera or EPYC?" in isolation. The practical question is how cheaply and reliably a platform can run 1,000 agents at once while creating workspaces, installing packages, executing tests, preserving logs, and enforcing permissions.

NVIDIA has one advantage in making that platform argument. It can combine Vera CPUs, Rubin GPUs, BlueField DPUs, Spectrum-X networking, MGX rack architecture, and NVLink-C2C inside one roadmap. NVIDIA says Vera and Rubin are connected through a unified memory architecture to keep GPU utilization high. That integration can create lock-in and pricing pressure, but it also gives major AI labs and cloud providers a rack-level optimization path. As agent workloads grow, the latency, power, cooling, observability, and policy behavior of the whole rack can matter more than the benchmark score of one chip.

Most development teams will not buy Vera systems directly in the near term. The operational lesson is still immediate. Teams building coding agents, data-analysis agents, or enterprise task agents should measure CPU time, container startup time, file I/O, package cache behavior, test duration, log retention, and sandbox reset cost alongside model latency. Users do not distinguish between "the model is slow" and "the tool execution layer is slow." They experience one wait.

Security adds another reason to treat the CPU as more than a secondary component. Once an agent executes code or touches the network, isolation, permissions, and audit logs enter the runtime. Anthropic's recent discussions of containment, VM isolation, and egress control belong to the same operating layer. Vera's confidential computing and server-standard positioning matter because AI execution environments are gaining more sensitive authority. The need for fast CPU execution and the need for strong CPU-backed isolation arrive in the same sandbox.

Three caveats should stay attached to the launch. First, the Phoronix results are early and omit public power and frequency data. Second, actual customer deployment patterns will become clearer only as 2026 ramp plans turn into cloud products and production workloads. Third, AMD EPYC Venice and Intel's next Xeon generation may change the general-purpose CPU comparison chart again. Vera's stronger claim is not "all server workloads." It is agent execution, RL environments, AI factory control planes, and the CPU-heavy work surrounding GPU inference.

The first Vera shipment moves one baseline in AI infrastructure discussion. If agents become work executors rather than answer generators, the bottleneck cannot be described only with model quality, GPU memory, and tokens per second. CPUs handle compilation, Python, browsers, databases, regular expressions, file search, network policy, and sandbox lifecycles. OpenAI and Anthropic receiving Vera systems is evidence that NVIDIA sees those off-GPU bottlenecks as their own market.

The practical question for builders is direct: can you separate your agent latency into model calls, tool execution, sandbox preparation, file I/O, package installation, and logging overhead? If Vera succeeds, cloud providers will package some of those measurements into new instance families and price sheets. If it does not, the measurement problem still remains. The perceived speed of agent products is already being decided partly outside the GPU.

Sources