NVIDIA Vera arrives, and agent infrastructure gets a CPU bottleneck

NVIDIA’s first Vera CPU deliveries show that agent infrastructure bottlenecks are spreading from GPU inference into CPU orchestration.

AI 요약

What happened: NVIDIA delivered its first Vera CPU systems to Anthropic, OpenAI, SpaceXAI, and OCI.
- Vera is a CPU for agentic AI with 88 Olympus cores and 1.2 TB/s of memory bandwidth.
Why it matters: The cost bottleneck for agents is expanding beyond GPU inference into sandbox execution, tool calls, orchestration, and long-context state.
Watch: NVIDIA’s claim is directionally important, but real efficiency still depends on workload-level validation by AI labs and cloud providers.
- OCI says it plans to deploy hundreds of thousands of Vera CPUs starting in 2026, but that remains a provider plan quoted in NVIDIA’s own blog.

NVIDIA has pulled Vera CPU back into the center of the AI infrastructure story. This time, the news is not a keynote slide. It is a delivery event. On May 18, 2026, NVIDIA said the first Vera CPU systems had been delivered to Anthropic, OpenAI, SpaceXAI, and Oracle Cloud Infrastructure. The CPU introduced at GTC in March as a processor for "agentic AI" is now moving into the hands of AI labs and cloud providers for validation.

That is more interesting than simply saying NVIDIA is selling a CPU. For the last two years, AI infrastructure conversation has been compressed into GPUs, HBM, inference throughput, training clusters, and accelerator supply. Agentic AI changes where some of the pressure lands. If a product only generates one answer from one model call, the GPU is the obvious center of gravity. If the product runs code, opens files, searches, launches a browser, starts a sandbox, verifies results, and revises its plan, the work around the model becomes much larger. NVIDIA is now describing that surrounding work in the language of CPUs.

NVIDIA’s framing is blunt. Its blog says AI agents do not run on GPUs alone, then lists agentic sandboxes, tool calls, orchestration layers, and long-context retrieval as CPU work. That is a product pitch, but it also matches what many builders now see in practice. Claude Code, Codex, Copilot coding agent, Google Antigravity, and similar systems need more than model inference. They check out repositories, install dependencies, run tests, drive browsers, inspect logs, and keep enough state around to recover from mistakes. During those steps, CPU, memory bandwidth, I/O, and networking all become part of the user experience.

NVIDIA Vera CPU system delivered to OpenAI

According to NVIDIA, the first deliveries happened on a Friday at Anthropic, OpenAI, and SpaceXAI, followed by OCI the next Monday. At Anthropic, a compute leader described Vera as part of the ecosystem needed for agentic workloads. At OpenAI, a compute infrastructure leader inspected the system. SpaceXAI is evaluating Vera for reinforcement learning workloads and agent-based simulation pipelines. OCI said it plans to deploy hundreds of thousands of Vera CPUs starting in 2026.

The important point is that Vera is not being positioned as a CPU that replaces GPUs. NVIDIA’s story is role separation. Vera can run as a standalone CPU system, and it also becomes the host CPU inside Vera Rubin NVL72 with Rubin GPUs. NVIDIA says Vera connects to Rubin through second-generation NVLink-C2C and provides 1.8 TB/s of coherent bandwidth. In the "AI factory" architecture NVIDIA keeps describing, the CPU is not a peripheral device. It is the control plane for data movement, tool execution, orchestration, memory coordination, and the work that keeps accelerators fed.

Olympus CPU cores

1.2 TB/s

Memory bandwidth

256

CPUs in one rack

22,500+

Concurrent CPU environments

The numbers from NVIDIA’s March announcement make the target workload clearer. Vera has 88 NVIDIA-designed Olympus cores, an LPDDR5X-based memory subsystem, and up to 1.2 TB/s of memory bandwidth. NVIDIA says it can produce results at twice the efficiency and 50% faster than traditional rack-scale CPUs in the workloads it cites. It also says a rack of 256 liquid-cooled Vera CPUs can keep more than 22,500 concurrent CPU environments running at full performance. That message is not about one model producing one answer. It is about many execution environments staying alive at the same time.

This connects directly to how AI product teams operate. Many teams are no longer measuring only model quality. They also track total agent session cost, browser sandbox lifetime, background job retries, repository checkout time, test runner wait time, tool execution timeouts, and observability gaps. Users may describe the product as a coding agent, research agent, or office agent, but the perceived quality often comes down to whether the work finishes, whether it stalls halfway through, and how quickly it opens a useful pull request or produces a verifiable result. In that zone, CPU capacity becomes a real cost axis again.

NVIDIA bringing Vera first to Anthropic and OpenAI is symbolic for that reason. These companies are not only frontier model labs. They are also pushing coding agents, enterprise workflows, and long-horizon tool use into products. Anthropic is expanding Claude Code and office-oriented workflows. OpenAI is extending Codex and ChatGPT-based work environments. Those products need more than strong model weights. They need persistent execution environments, file systems, browsers, network egress, permissions, observability, and cost control.

The pitch still needs careful reading. "CPU for agentic AI" is a broad label. Some agents remain mostly inference-bound. Some are code-execution-bound. Others are blocked by database latency, retrieval latency, network policy, or tool reliability. NVIDIA’s performance numbers also come from workloads and reference architectures selected by NVIDIA. Community reaction reflects that caution. Hardware discussions around Vera generally accept that AI-oriented workloads may benefit, while warning that this does not automatically mean Vera displaces AMD EPYC or Intel Xeon across the general server market.

Still, the direction is hard to ignore. As AI systems move from answering to acting, they have to process more small operations in parallel. Tool calls add API latency, serialization, retries, and permission checks. Sandboxes add process, filesystem, package installation, and cleanup costs. Reinforcement learning rollouts need many environment instances. Long-context agents touch memory and storage layers repeatedly. If the CPU cannot prepare, verify, clean up, and retry while the GPU generates tokens, the whole pipeline loses utilization. NVIDIA’s "extreme codesign" language is an attempt to close that gap.

Agent workload	GPU-centered bottleneck	CPU-centered bottleneck
Coding agent	Planning, code-change proposals, review reasoning	Checkout, install, compile, test, sandbox lifetime
Research agent	Summaries, hypothesis generation, long-context reasoning	Search, data cleanup, Python execution, result verification
RL and simulation	Policy and value model training or inference	Mass environment rollout, orchestration, state sync

OCI’s statement matters for cloud competition. NVIDIA’s blog quotes OCI as planning to deploy hundreds of thousands of Vera CPUs from 2026 and describes OCI as the first cloud provider to deploy Vera at hyperscale. If that becomes real service SKUs and pricing, AI infrastructure competition in the cloud expands beyond GPU quota. The question becomes which CPU, GPU, memory, storage, and sandbox combination can run agentic workloads cheaply and reliably. Developers will compare not only model API prices, but also agent runtime duration, sandbox lifetime, tool execution concurrency, cold start behavior, and failure recovery.

That shift also reaches AI coding tools directly. NVIDIA’s March materials said Cursor is adopting Vera to improve throughput and efficiency for AI coding agent experiences. The key lesson is not simply that a better model makes a coding agent faster. The throughput of the entire work graph matters. The same model can feel different depending on repository size, dependency graph, test suite behavior, browser automation, and CI integration. Vera is NVIDIA’s attempt to pull more of that lower layer into its own platform.

At the current stage, however, it would be premature to call this a change in the overall CPU market winner. Vera has entered full production and partner availability according to NVIDIA, but public large-scale operating data is still limited. Benchmarks are sensitive to workload selection. General-purpose databases, web services, storage systems, and enterprise backend workloads are separate questions. NVIDIA’s strongest argument is inside an integrated AI factory architecture where GPUs, CPUs, NVLink, memory, software, and cloud partners are designed together.

That makes the more accurate reading this: agents are making CPU infrastructure first-class again. The first bottleneck of the modern AI boom was GPU supply. The next bottleneck for agentic products may be execution density and orchestration. If thousands of agent sessions are running code, calling tools, keeping state, recovering from failures, and coordinating with model inference, the CPU is part of the product surface.

The practical questions for development teams are also changing. Model choice and prompt quality remain important, but they are no longer enough. Teams need to ask which sandbox their agent runtime uses, where tool calls execute, how long-running session state is stored, where tests and browser automation bottleneck, and whether cost observability stops at tokens or reaches infrastructure. Vera is NVIDIA’s answer to those questions, but it is not the only possible answer.

The bigger story is that an AI agent is not only an inference product. It is a distributed system. Once the product becomes a distributed system, bottlenecks do not stay inside the model. The first Vera systems going to Anthropic, OpenAI, SpaceXAI, and OCI put that reality into hardware supply-chain language. The agent infrastructure race is no longer just about who can buy the most GPUs. It is also about who can design the CPU, memory, storage, network, and sandbox layers tightly enough to keep those GPUs useful.