Devlery
Blog/AI

NVIDIA Is Targeting the RL Training Loop

NVIDIA and Ineffable Intelligence are pointing the model race toward RL infrastructure for agents that learn from experience.

NVIDIA Is Targeting the RL Training Loop
AI 요약
  • What happened: NVIDIA and Ineffable Intelligence announced an engineering collaboration around large-scale reinforcement learning infrastructure.
    • Ineffable is the London AI lab founded by AlphaGo researcher David Silver, and the work starts on NVIDIA Grace Blackwell while exploring Vera Rubin.
  • Why it matters: The model race is shifting from pretraining on human data toward learning loops that generate, evaluate, and update from experience.
  • Builder impact: Agent-era infrastructure is not just about GPU FLOPS. Simulation, interconnect, memory bandwidth, serving, and evaluators have to be designed together.
  • Watch: The "superlearner" framing is still a research vision. Its success will depend less on public benchmark claims than on environment design and iteration cost.

NVIDIA and Ineffable Intelligence announced on May 13, 2026 that they are working together on infrastructure for reinforcement learning, or RL, at large scale. At first glance it reads like another AI infrastructure partnership. But the announcement captures a sharper shift in the AI race: what comes after models have absorbed ever-larger piles of web text, code, papers, images, video, and human interaction data? NVIDIA and Ineffable are placing that next step inside systems that learn from experience.

Ineffable Intelligence is the London AI research lab founded by David Silver, best known for leading core AlphaGo research at DeepMind. The company's own framing is ambitious: it wants to build a "superlearner" that does not merely repeat knowledge learned from human data, but continually discovers knowledge and skills from its own experience. That phrasing deserves caution. This is not yet a public product or open model. It is a research thesis. Still, it is not an abstract thesis floating by itself. Silver's history in reinforcement learning, NVIDIA's training and inference stack, and the industry's search for learning methods beyond static human data all point in the same direction.

Ineffable Intelligence official logotype

In NVIDIA's announcement, Jensen Huang described the next AI frontier as superlearners that continuously learn from experience. Silver's argument is more direct: researchers have made substantial progress on the easier problem of building AI systems that know what humans already know, but now need to solve the harder problem of systems that discover new knowledge for themselves. The practical interpretation is this: AI is moving from compressing existing human knowledge toward interacting with environments, running experiments, and turning feedback into new capability.

That claim matters because the dominant assumption behind today's LLM race is still data. Large language models learn from human-created text, code, images, audio, video, preferences, and interaction logs. Synthetic data and self-play are already part of the stack, but most production systems still sit inside a familiar shape: pretraining, supervised fine-tuning, preference optimization, and evaluation sets. Improving performance often means finding better data, training a larger model, extending context, or adding more careful post-training. As human data becomes more expensive, legally sensitive, duplicated, contaminated, and harder to curate, the strategic question becomes harder to avoid: can models keep advancing by learning mainly from traces humans already left behind?

RL has long offered a different answer. An agent acts, the environment responds, a reward or score is assigned, and the policy is updated. AlphaGo and AlphaZero were powerful not because they memorized human game records, but because self-play let them explore parts of the state space humans had not directly supplied. Of course, Go is not the real world. Go has explicit rules, clean state, and relatively legible rewards. Software engineering, scientific research, robotics, long-horizon planning, and organizational work are much messier. That is why the core of this announcement is not simply "we will use RL." It is "we will co-design the infrastructure that can carry RL at this scale."

NVIDIA and Ineffable describe the difference between pretraining and RL workloads in infrastructure terms. In pretraining, a fixed human dataset flows through the system. In RL, data is produced on the fly. The system has to act, observe, score, and update in a tight loop. That loop puts different pressure on interconnect, memory bandwidth, and serving than classic pretraining does. Ineffable's target of richer experience may also involve data very different from ordinary language corpora: simulations, code execution, game-like environments, physical models, tool-use traces, or robot sensor streams. Once those enter the loop, model architecture and training algorithms may need to change as well.

Act: the agent chooses the next experiment in an environment

Observe: collect simulation, tool, sensor, or code execution results

Score: grade outcomes with rewards, verifiers, tests, or judge models

Update: adjust the policy, memory, dataset, or environment difficulty

This loop is harder to run than ordinary batch training. Pretraining can be simplified as the problem of pushing a massive data pipeline and matrix operations through a stable cluster. That is already extremely hard, but the data is mostly prepared before it enters the training run. In RL, the model's action changes the next data point, and the environment's response changes the training path. Running many parallel environments can raise throughput, but it can also increase delay between observation, evaluation, and update. If the loop is too slow, learning becomes sluggish. If it is too expensive, the system cannot afford enough experiments. GPU speed alone is not the whole story.

The way NVIDIA connects this announcement to its own platform roadmap is the most revealing part. The work begins on NVIDIA Grace Blackwell and will explore the upcoming NVIDIA Vera Rubin platform early. Grace Blackwell is already positioned as a current-generation platform for large-scale AI training and inference. Vera Rubin is the next platform NVIDIA has been framing around agentic workloads and long-context inference. Ineffable entering that path means the RL lab is not just a GPU customer. It becomes a partner helping define what the next AI workload should look like.

The context becomes clearer when this announcement is read alongside NVIDIA's May 5, 2026 technical blog post about agentic systems and "extreme co-design." In that post, NVIDIA argued that agentic workloads differ sharply from ordinary chatbots. A chatbot normally alternates between a user message and a model response. Tool calling makes the pattern less predictable because tool output enters the context. Agentic systems go further: the model decides how many tools to call, in what order, whether to create sub-agents, and when the task is complete. The workload itself becomes structurally probabilistic.

NVIDIA's example Claude Code session made the issue concrete. Over 33 minutes, the session generated 283 inference requests, including 58 main-agent turns and 225 sub-agent invocations. Context grew from 15K tokens to 156K tokens before compaction reduced it to roughly 20K. That example is not the same experiment as the Ineffable RL collaboration, but it explains why NVIDIA treats agent workloads as a co-design problem. When an agent works for a long time, creates sub-tasks, and keeps interacting with an environment, tokens, cache, networking, memory, and latency all become part of product quality.

283
inference requests in NVIDIA's example agentic coding session
225
sub-agent invocations
156K
peak context tokens
15x
token increase noted in Anthropic's multi-agent system report

An RL-based superlearner could be even more demanding. A coding agent at least leaves behind files, tests, diffs, and logs. In scientific discovery, robot control, or complex simulation, the system must also define which actions are meaningful, how rewards should be shaped, how failures should be classified, and how unsafe exploration should be blocked. The phrase "learns from experience" sounds elegant, but implementing it means building environments, evaluators, failure triage, replay or reuse mechanisms, and infrastructure that can survive repeated failure at high speed.

That is why this news is more structural than "NVIDIA will sell more GPUs." NVIDIA is already central to training and inference infrastructure. If LLM pretraining becomes more standardized, the next layer of differentiation moves toward workload-specific optimization. Agentic inference needs prompt caching, KV cache management, context storage, low-latency fabric, and scheduling tuned for irregular execution. RL training needs environment rollouts, scoring, online data generation, distributed updates, simulation throughput, and evaluation loops. In both cases, system-level coordination matters more than the performance of a single accelerator in isolation.

NVIDIA packages this as "extreme co-design." Its platform story combines pieces such as Vera Rubin NVL72, Vera CPUs, NVLink 6, ConnectX-9, BlueField-4, Spectrum-X, Dynamo, NVFP4, TRT-LLM WideEP, and speculative decoding. Each piece is meant to attack a different bottleneck. There is no need to accept every vendor framing at face value; infrastructure announcements naturally include an idealized view of future platforms. The direction, however, is credible. Agents and RL become expensive and slow when compute, memory, networking, storage, serving schedulers, cache policy, and environment runtime are designed as separate systems.

For Ineffable, the NVIDIA collaboration is more than a procurement story. Silver's message creates distance from human-data-centered AI. Ineffable's homepage says superintelligence will come from experience, not human data. It separates the problem of collecting what humans already know from the problem of discovering what humans do not know yet. That distinction can easily become overdrawn, but it hits a real frustration in the field. LLMs are astonishingly useful, yet their ability to reliably discover new algorithms, run long experiments, verify scientific claims, or acquire physical skills through trial and error remains limited.

The competitive landscape is broader than one startup and one chip company. Google DeepMind is still the natural reference point for RL and self-play, with a line from AlphaGo and AlphaZero to later work such as AlphaDev and AlphaEvolve. OpenAI and Anthropic keep using reinforcement learning and evaluation loops across RLHF, RLAIF, tool-use agents, coding agents, and safety evaluations. xAI, Meta, and major Chinese labs are also leaning heavily on RL-like methods for reasoning models and agent training. Ineffable's distinction is not that it wants to use RL as one post-training technique. It is placing RL itself at the center of its route toward more capable AI.

There are clear reasons for skepticism. First, RL gets harder to validate as environments become less clean. It is powerful in domains with clear rewards, such as games or code tests, but in scientific discovery or general intelligence, a poorly designed reward function can push models toward shortcuts. Second, if a simulation does not represent reality well enough, the learned policy can fail outside the simulator. Third, RL is exploration-heavy. Turning trial and error into knowledge means absorbing many failures and containing them safely. Fourth, "discovering new knowledge" can remain a marketing phrase unless discoveries are independently validated by humans and institutions.

For developers and AI teams, the useful question is not whether Ineffable is about to build AGI. The more practical question is: how automated is your own agent learning loop? Many teams still build an agent, inspect logs manually, and adjust prompts by hand. More mature teams add evaluation datasets and regression tests. The next step is a closed loop that classifies failure traces from real environments, generates synthetic scenarios, improves policies or tool specifications, and reruns evaluation. At small scale, that is how coding agents improve. At large scale, it becomes experience-driven RL.

CategoryHuman-data-centered learningExperience-centered RL learning
DataWeb, code, documents, conversations, and preference dataSimulations, tool results, sensor streams, self-play, and execution traces
BottleneckData quality, duplication, copyright, and pretraining costEnvironment design, rollout cost, reward validation, and tight-loop latency
InfrastructureLarge-scale batch training and stable data pipelinesCo-designed interconnect, memory bandwidth, serving, evaluators, and simulators

From that perspective, NVIDIA's position is strong. Whatever algorithm model labs choose, they need infrastructure that can run more experiments at lower latency and lower cost. RL makes this especially important because failure and repetition are not incidental; they are the method. A single large training run matters, but the speed and cost of repeated loops may become even more important. That is why NVIDIA wants labs like Ineffable on the Grace Blackwell to Vera Rubin path. The infrastructure supplier is trying to read the next learning paradigm early and formalize that workload around its own platform.

This does not mean every company should think about buying a Vera Rubin cluster. For most development teams, the immediate need is a smaller feedback loop. Record where the agent failed. Tag the failure type. Create a reproducible test. Prevent the next version from regressing. Feed real traces into better evals. Those practices are far from the frontier lab version of RL, but they share the same discipline: agents improve when experience is captured, evaluated, and reused instead of being treated as disposable logs.

Community reaction remains relatively muted. Hacker News had several links earlier in 2026 about Ineffable's seed funding, David Silver's interviews, and the company's RL-centered vision, but those did not become large debates. On the announcement day, there did not appear to be a major discussion focused on the NVIDIA collaboration itself. That may simply mean the topic still sits closer to the research and infrastructure layer than to everyday product users. But the underlying question of what comes after human data is likely to spread. In domains where data is scarce, rewards are clearer, or simulations can be built, RL-based iterative learning can move back to the foreground.

The main point is not just what NVIDIA and Ineffable promised. It is where the AI race is widening. In 2023 and 2024, the dominant keywords were LLM scale and chatbot products. In 2025 and 2026, they became agents, tool use, long context, coding automation, and enterprise control planes. The next layer is infrastructure that helps agents learn better from experience. Ineffable may succeed, Google DeepMind may stay ahead, or OpenAI and Anthropic may turn their RL and evaluation loops into products faster. The question remains the same: if models are going to move beyond mirroring human data and instead experiment, fail, and improve, who builds the experience, who evaluates it, and what infrastructure keeps that loop running? NVIDIA is using this collaboration to stake a claim on the systems side of that answer.