Nvidia slaps $20B Groq tech into massive new LPX racks to speed AI response time

Summary

Nvidia is integrating Groq’s language processing units (LPUs) — hardware it acquired for about $20 billion — into new LPX rack systems to accelerate low-latency LLM token generation. The company will pair its Vera Rubin GPU racks with LPX racks that contain up to 256 Groq 3 LPUs each, connected via a custom Spectrum-X interconnect. GPUs will handle the compute-heavy prefill stage while LPUs, with far higher SRAM bandwidth, will handle the decode stage to push token rates into the hundreds or thousands per second per user.

Key Points

Nvidia will use Groq LPUs as decode accelerators alongside Vera Rubin GPUs rather than replacing GPUs.
Groq 3 LPUs offer about 150 TB/s memory bandwidth versus Rubin GPU’s ~22 TB/s, making them ideal for low-latency token generation.
Each LPU delivers ~1.2 petaFLOPS (FP8) and has 500 MB on-board memory; that requires many chips to support trillion-parameter models.
Nvidia plans 256 LPUs per LPX rack; multiple LPX racks can be ganged to support very large models, but memory remains a constraint.
Nvidia expects to charge premium prices for high-rate inference (estimates mentioned up to ~$45 per million tokens for premium offers).
Initial shipments and software access will target model builders and service providers; LPUs do not natively support CUDA yet and are used as accelerators to CUDA-running systems.
Competitors are pursuing similar pairings — AWS is collaborating with Cerebras to combine Trainium 3 and wafer-scale SRAM chips for low-latency inference.

Context and relevance

This move marks a shift in Nvidia’s inference strategy: instead of shipping its own dedicated prefill CPX processors, it has chosen to integrate Groq’s SRAM-heavy LPU tech for decode acceleration. The architecture splits work between Rubin GPUs (prefill) and Groq LPUs (decode), addressing growing demand for ultra-low-latency token generation from service providers running trillion-parameter models. It also positions Nvidia directly in competition with combined-platform approaches from AWS+Cerebras and other boutique accelerator vendors.

For datacentre and service operators, the announcement matters because it changes the performance and cost calculus for serving large LLMs at scale: extreme token rates require specialised hardware and large counts of small-memory, ultra-fast LPUs, while software, interconnects and pricing models will determine commercial viability.

Why should I read this?

Short version: Nvidia just bolted a $20bn-speedy decode engine onto its rocket — if you care about serving huge LLMs fast (or about the economics of inference), this shapes who wins the low-latency race. It’s a pragmatic play that tells you where the industry is putting bets: SRAM-heavy accelerators for token spewing, GPUs for the heavy lifting. Read it if you want the heads-up on how inference stacks and costs are likely to change.

Source

Source: https://go.theregister.com/feed/www.theregister.com/2026/03/16/nvidia_lpx_groq_3/