Decoding Nvidia’s Groq-powered LPX and the rest of its new rack systems

Summary

Nvidia used a costly acquisition and an engineering hire to field Groq-derived LP30 LPUs quickly, producing LPX racks that focus on ultra-low-latency token generation. An LPX rack holds 256 LPUs (32 trays × 8 LPUs) built on a data‑flow, SRAM-only Groq 3 design made by Samsung. Each LP30 offers very high FP8 throughput and enormous SRAM bandwidth (up to ~150 TB/s) but only modest on‑chip capacity (≈500 MB per LPU). LPX is designed to be paired with Vera‑Rubin NVL72 GPUs: GPUs handle prefill/attention and large memory needs, LPUs handle bandwidth‑heavy decode work. Nvidia positions LPX for hyperscalers and model builders serving trillion‑parameter models at very high token rates; enterprises are largely priced out for full LPX deployments.

Nvidia also showcased complementary racks: Vera CPU racks for agentic workloads, BlueField‑4 STX storage racks for KV‑cache offload, and Spectrum‑6 SPX network racks. Together these form a disaggregated inference stack (Dynamo orchestration) that routes prefill, decode and cache offload across specialised hardware.

Key Points

Nvidia paid roughly $20bn for Groq IP and engineers to accelerate time‑to‑market rather than re‑design from scratch.
LPX racks pack 256 LP30 LPUs (32 trays × 8 LPUs) with on‑chip SRAM and a data‑flow architecture optimised for the decode/token generation phase.
LP30 characteristics: ~1.2 petaFLOPS FP8, ~500 MB SRAM per chip, ~150 TB/s bandwidth, 96+ interconnects per chip (112 Gbps SerDes) and Samsung fabrication.
LP30 lacks NVLink, NVFP4 support and CUDA compatibility at launch — Nvidia prioritised shipping speed over full feature parity.
LPX is intended to be used alongside Vera‑Rubin GPUs: GPUs for prefill/attention and large memory; LPUs for bandwidth‑intensive feed‑forward ops during decode.
Nvidia expects test‑time scaling and speculative decode workflows to benefit most from LPX; Huang hinted at high token pricing (eg. up to ~$150 per million tokens for premium low‑latency inference).
For trillion‑parameter models you need many LPUs: roughly 4–8 LPX racks (1,024–2,048 LPUs) depending on weight precision and model partitioning.
LPX is aimed at hyperscalers, cloud service providers and model builders; most enterprises will find full LPX deployments out of reach financially.
CPX (Rubin CPX) has been deprioritised for now — it targeted faster time‑to‑first‑token for large contexts but is not replaced by LPX; the concepts address different pipeline phases.
Nvidia’s new rack family (Vera CPU, STX storage, SPX network) creates a disaggregated assembly line for agentic AI: agents (Vera CPU) call models (Rubin + LPX), KV caches offload to STX, interconnected by SPX.

Why should I read this?

Because this explains, in plain terms, why Nvidia spent big on Groq tech and how that money actually shows up in hardware. If you care about who will run trillion‑parameter models fast and cheaply (or not), this piece tells you where Nvidia is placing its bets: shave latency with LPUs, glue systems with a new backplane and sell a whole rack ecosystem. Short version — it’s about speed, cash and keeping cloud customers hooked. We did the heavy reading so you don’t have to.

Context and relevance

This is important for anyone tracking AI infrastructure, cloud providers, and model deployment economics. Nvidia’s Groq purchase and LPX arrival signal a shift: specialised SRAM/data‑flow silicon will be used alongside GPUs to optimise different inference phases. That matters because the balance between throughput and per‑user interactivity is driving new architectures, pricing models, and vendor partnerships (eg. AWS, Cerebras). Expect more disaggregated stacks, novel orchestration (Dynamo), and a clearer hyperscaler vs enterprise divide for cutting‑edge inference hardware.

Source

Source: https://go.theregister.com/feed/www.theregister.com/2026/03/19/nvidia_lpx_deep_dive/