Furiosa and Broadcom are designing a 2nm token factory
FuriosaAI and Broadcom’s third-generation inference chip plan shows how agentic AI is shifting the bottleneck from raw GPU speed to token density, networking, and power.
- What happened: FuriosaAI announced a partnership with Broadcom to co-develop a third-generation AI inference accelerator.
- The announcement landed on
May 27, 2026, with a roadmap around2nmcompute dies,HBM4/4E, multi-die chiplets, and an Ethernet fabric.
- The announcement landed on
- Why it matters: Agentic inference is moving the competition from raw GPU throughput toward token density, rack-scale communication, and energy efficiency.
- Watch: Sampling is planned for the first half of 2028, so hyperscale performance is still an unproven claim.
- The strongest evidence today is RNGD mass production, validation references from Samsung SDS and LG AI Research, and the architecture claims in the official release.
FuriosaAI has announced that it will work with Broadcom on a third-generation AI inference accelerator. The announcement date was May 27, 2026. On the surface, this is a partnership between an AI semiconductor startup and one of the largest companies in semiconductor infrastructure. The more interesting part is the problem statement. Furiosa and Broadcom frame the collaboration around "agentic AI" and "high-volume token requirements." This is not just another story about attaching more GPUs. It is a story about how to design an inference factory that keeps producing tokens.
The center of gravity in AI products is moving away from chatbots that answer once and stop. The new center is the long-running agent. An agent plans, calls tools, searches, runs code, checks results, and invokes the model again. That loop multiplies inference calls. It stretches context windows. It makes MoE routing, post-training sampling, tool output parsing, and verification part of the production workload. As models get more capable, total compute does not automatically fall. Products hand more work to the model, and the total amount of generated and processed tokens rises.
That is why the core of this announcement is not "a new chip is coming." The core is Furiosa's plan to combine its Tensor Contraction Processor, or TCP architecture, with Broadcom's XPU Technology and IP Platform, Ethernet scale-up, fabric switches, and advanced packaging. Furiosa describes the result as a multi-die chiplet system. The third-generation platform is expected to use 2nm compute dies, HBM4/4E, high-speed inter-chip networking, and an all-to-all-capable topology.

For builders, that sentence means the axis of competition is changing. Many teams still start with questions such as which model is smarter, or which GPU produces more tokens per second. Those questions matter, but agentic workloads make them incomplete. The new questions are about how reliably many concurrent sessions produce tokens, how efficiently data moves across and beyond the rack, whether memory bandwidth can keep up with long contexts and MoE routing, and how quickly a compiler can map a new model family onto silicon. The Furiosa and Broadcom announcement is a hardware answer to those questions.
In the official announcement, Charlie Kawwas of Broadcom's Semiconductor Solutions Group says inference performance is no longer defined only by raw compute. He points to data reuse and communication efficiency across servers and racks as rising constraints. That may read like press release language, but it maps closely to real AI infrastructure. Large-model inference does not end inside one accelerator. Weights, activations, KV cache, expert routing, batching, speculative decoding, and request scheduling all touch memory and network paths.
Furiosa's near-term foundation is RNGD. The company describes RNGD as a data-center inference chip manufactured on TSMC 5nm and currently in mass production. The Business Wire release describes RNGD as a 180W PCIe accelerator aimed at LLM and agentic AI workloads. It also says Samsung SDS and LG AI Research have validated RNGD in production environments. That matters because the third-generation platform is still a roadmap. Furiosa is first trying to show that it already has a production chip and customer validation before asking the market to believe the next platform story.

It would be a mistake to read this announcement as an immediate GPU replacement story. AI accelerator announcements almost always use strong language. Claims that GPUs carry a legacy tax, or that a different architecture can deliver higher performance per watt, depend heavily on workload and comparison conditions. Furiosa's own blog argues that the GPU SIMT model struggles with irregular memory patterns and high-frequency communication. But total cost of ownership, model-specific latency, batching efficiency, compiler maturity, and operational reliability cannot be judged from an announcement alone.
Still, the collaboration is hard to dismiss. Broadcom is not only a manufacturing partner. It has a deep position in AI data-center networking, custom silicon, advanced packaging, and Ethernet fabric. In hyperscaler custom-chip discussions, Broadcom often appears alongside Google TPU, Meta MTIA, and several non-public custom accelerator programs. For Furiosa, pairing with Broadcom turns the narrative from "we have a good chip" into "we have a supply-chain and interconnect strategy that can extend into the rack and cluster."
| Layer | RNGD foundation | Third-generation direction |
|---|---|---|
| Process and packaging | TSMC 5nm, PCIe accelerator, 180W class | 2nm compute die, multi-die chiplet, advanced packaging |
| Memory | Memory subsystem for data-center LLM inference | HBM4/4E for frontier inference and MoE routing |
| Scaling model | Server-level inference efficiency and mass-production validation | Rack-scale token factory based on Ethernet, PCIe, and fabric switches |
| Software | Furiosa SDK, PyTorch mapping, compiler-centered approach | Portability that absorbs new frontier models and optimization methods quickly |
The terms Furiosa keeps repeating are "token density" and "performance per watt." Those phrases line up with how AI product teams feel cost. Whether the product is a general assistant such as ChatGPT, a coding agent such as Claude Code or Codex, or an enterprise workflow agent, the real cost is not captured by a single per-call model price. It depends on how many model calls an agent needs to finish one task, how much context it carries, how often it reinterprets tool results, and how many retries happen after failures. If tokens are the unit of production, the data center starts to look like a token factory.
GPUs remain central in that factory. But "just add GPUs" is becoming a narrower answer. As agents multiply, CPUs, memory, storage, networks, sandboxes, queues, and schedulers all get pulled into the hot path. Nvidia's push around Vera as a CPU for agentic AI, and cloud providers' emphasis on managed agent runtimes and sandboxes, point in the same direction. Furiosa and Broadcom are pushing the discussion toward inference accelerators and Ethernet fabric. They are not only selling the moment when a model emits a token. They are selling the cluster structure that keeps tokens flowing.
For developers, the more direct question is software portability. Furiosa criticizes legacy platforms for requiring too many hand-tuned kernels whenever models change, and says its SDK uses a general compiler to map high-level PyTorch code onto silicon. That claim is important, but it needs proof. AI model architectures move quickly. Attention optimizations, KV-cache layouts, quantization, speculative decoding, MoE routing, and multimodal preprocessing keep changing. The speed at which a compiler follows those changes can become the real barrier to accelerator adoption.
CUDA's strength is not simply that Nvidia GPUs are fast. It is the combination of kernels, framework integrations, profiling tools, operator coverage, community experience, and cloud availability. Furiosa's argument that copying the CUDA library ecosystem is a strategic dead end makes sense in that context. The company is not trying to play the same game by cloning every layer. It is trying to design the architecture and compiler differently so new models can land with less manual work. For that to matter in production, it has to survive the less glamorous work: integration with PyTorch, vLLM-like serving stacks, Kubernetes, observability, model registries, and benchmark harnesses.
The 2nm and HBM4/4E roadmap details are eye-catching, but the most important phrase may be "all-to-all-capable topology." As frontier inference shifts toward mixture-of-experts models, each token may be routed to different experts, and inter-chip communication patterns become more complex. Increasing the compute inside one accelerator is not enough. Expert routing, batching, and context reuse can create bottlenecks across the whole cluster. Broadcom's Ethernet scale-up and fabric switches enter the story at exactly that point.
Data Center Knowledge read the partnership as a bet by Broadcom and Furiosa on Ethernet AI fabrics. Data Center Dynamics covered it through the lens of a third-generation AI inference platform. Deep debate has not yet formed across Hacker News or developer forums. Related Reddit posts are mostly stock-news links and press release shares. That silence does not make the infrastructure story unimportant. Infrastructure news often does not feel immediate, but it comes back later through cloud SKUs, price sheets, agent runtime latency, and the shape of managed AI platforms.
The roadmap also needs a sober reading. The Business Wire release says sampling is expected to begin in the first half of 2028. That means this is not a product developers can use today. In 2026, the practical options for most AI product teams are still Nvidia, AMD, TPU, Trainium and Inferentia, some specialty accelerators, and model APIs. Furiosa and Broadcom's third-generation chip is closer to a declaration about inference demand after 2028. Its significance is less about short-term purchasing and more about direction.
That direction is clear enough. First, inference has become a strategic layer on par with training. Second, agentic AI can create a steep increase in token demand. Third, power envelope and rack density are becoming as important as model quality. Fourth, the network fabric and compiler are becoming part of the accelerator. Fifth, custom silicon competition is becoming more multipolar, with a Korean-founded AI semiconductor company aiming directly at the hyperscale market beside Broadcom.
For readers outside Korea, FuriosaAI's national origin is not the main story. The more important shift is that AI infrastructure competition is moving from "model company versus model company" toward "token production chain versus token production chain." As OpenAI, Anthropic, Google, Meta, and others keep improving models, somebody has to build the factory that runs those models cheaply and quickly. That factory includes accelerators, HBM, switches, compilers, schedulers, cooling, power contracts, and cloud pricing.
Seen that way, the Furiosa and Broadcom combination is a detour around the dominant road. It does not try to replicate Nvidia's CUDA and NVLink ecosystem head-on. It points instead to TCP architecture, Broadcom fabric, a general compiler, and HBM4/4E chiplets as a way to attack a different set of bottlenecks. We do not yet know whether that detour becomes a highway. But if agents keep creating more inference calls, enterprises keep asking for private and sovereign inference, and data-center power keeps becoming one of the scarcest resources in AI, more detours like this will appear.
The questions development teams should track are changing as well. Model benchmarks are not enough. Teams will need to ask what tokens per watt looks like at the same quality level, how cache and memory behave during long-running agent sessions, whether interconnects become the bottleneck while serving MoE models, how fast compilers support new model families, and how cloud vendors price accelerator capacity. The user experience of agentic AI will be built by both smarter models and execution infrastructure that does not stall.
FuriosaAI and Broadcom's announcement is not an answer yet. It is a set of questions with a 2028 sampling date attached. Real performance, supply, software maturity, and cloud availability still need to be proven. But the question itself matters. The bottleneck in the agent era is not only the TFLOPS of a single GPU. It is the full system that can generate many tokens, for a long time, reliably, inside a power budget. This partnership is an attempt to design that system as a token factory.
Sources: FuriosaAI official announcement, Business Wire release, Data Center Knowledge report, Data Center Dynamics report.