Nvidia slaps $20B Groq tech into massive new LPX racks to speed AI response time

Nvidia slaps $20B Groq tech into massive new LPX racks to speed AI response time

Summary

Nvidia is integrating Groq’s language processing units (LPUs) — hardware it acquired for about $20 billion — into new LPX rack systems to accelerate low-latency LLM token generation. The company will pair its Vera Rubin GPU racks with LPX racks that contain up to 256 Groq 3 LPUs each, connected via a custom Spectrum-X interconnect. GPUs will handle the compute-heavy prefill stage while LPUs, with far higher SRAM bandwidth, will handle the decode stage to push token rates into the hundreds or thousands per second per user.

Key Points

  • Nvidia will use Groq LPUs as decode accelerators alongside Vera Rubin GPUs rather than replacing GPUs.
  • Groq 3 LPUs offer about 150 TB/s memory bandwidth versus Rubin GPU’s ~22 TB/s, making them ideal for low-latency token generation.
  • Each LPU delivers ~1.2 petaFLOPS (FP8) and has 500 MB on-board memory; that requires many chips to support trillion-parameter models.
  • Nvidia plans 256 LPUs per LPX rack; multiple LPX racks can be ganged to support very large models, but memory remains a constraint.
  • Nvidia expects to charge premium prices for high-rate inference (estimates mentioned up to ~$45 per million tokens for premium offers).
  • Initial shipments and software access will target model builders and service providers; LPUs do not natively support CUDA yet and are used as accelerators to CUDA-running systems.
  • Competitors are pursuing similar pairings — AWS is collaborating with Cerebras to combine Trainium 3 and wafer-scale SRAM chips for low-latency inference.

Context and relevance

This move marks a shift in Nvidia’s inference strategy: instead of shipping its own dedicated prefill CPX processors, it has chosen to integrate Groq’s SRAM-heavy LPU tech for decode acceleration. The architecture splits work between Rubin GPUs (prefill) and Groq LPUs (decode), addressing growing demand for ultra-low-latency token generation from service providers running trillion-parameter models. It also positions Nvidia directly in competition with combined-platform approaches from AWS+Cerebras and other boutique accelerator vendors.

For datacentre and service operators, the announcement matters because it changes the performance and cost calculus for serving large LLMs at scale: extreme token rates require specialised hardware and large counts of small-memory, ultra-fast LPUs, while software, interconnects and pricing models will determine commercial viability.

Why should I read this?

Short version: Nvidia just bolted a $20bn-speedy decode engine onto its rocket — if you care about serving huge LLMs fast (or about the economics of inference), this shapes who wins the low-latency race. It’s a pragmatic play that tells you where the industry is putting bets: SRAM-heavy accelerators for token spewing, GPUs for the heavy lifting. Read it if you want the heads-up on how inference stacks and costs are likely to change.

Source

Source: https://go.theregister.com/feed/www.theregister.com/2026/03/16/nvidia_lpx_groq_3/