Amazon primed to fuse Nvidia’s NVLink into 4th-gen Trainium accelerators
Summary
AWS has teased Trainium4 accelerators at Re:Invent, saying the next-generation chips will adopt Nvidia’s NVLink Fusion interconnect to enable high-speed chip-to-chip communication across accelerators, Graviton CPUs and EFA networking in MGX racks. Amazon claims major performance uplifts for Trainium4 (3x FP8 FLOPS, 6x FP4, 4x memory bandwidth), though it hasn’t clarified whether those gains apply per chip or per UltraServer rack.
The announcement follows Nvidia’s move to open NVLink Fusion to other vendors earlier this year. AWS also launched Trainium3 EC2 offerings: a 144‑GB HBM3E chip delivering ~2.5 petaFLOPS dense FP8 (and up to 10 petaFLOPS with 16:4 structured sparsity), assembled into UltraServers with 144 chips and huge aggregate memory bandwidth. AWS says its new fabrics and EFA networking could support clusters of up to a million accelerators.
Key Points
- AWS will integrate Nvidia’s NVLink Fusion into Trainium4 accelerators to enable NVLink-based chip-to-chip communications.
- Nvidia’s NVLink fabrics currently offer up to 1.8 TB/s per GPU (900 GB/s each direction) and are expected to double to 3.6 TB/s next year.
- AWS claims Trainium4 will deliver 3x FLOPS at FP8, 6x at FP4, and 4x memory bandwidth — specifics (chip vs rack) remain unclear.
- Trainium3 is now on EC2: each chip has 144 GB HBM3E (4.9 TB/s) and ~2.5 petaFLOPS FP8; with 16:4 sparsity it can hit ~10 petaFLOPS for supported workloads.
- Trainium3 UltraServers pack 144 chips (all-to-all fabric via NeuronSwitch-v1) giving ~706 TB/s memory bandwidth and multi-hundred to multi-thousand petaFLOPS depending on sparsity use.
- AWS positions its interconnect and EFA networking to scale to roughly one million accelerators for production deployments.
- AWS will also offer Nvidia GB300 NVL72-based instances for customers staying with Nvidia hardware.
Context and relevance
Punchy take: this is a notable shift. NVLink Fusion being used outside Nvidia-branded accelerators is material — it signals greater interop between dominant accelerator and cloud silicon ecosystems. For cloud architects, chip designers and HPC teams, NVLink on Trainium4 could change performance and deployment math for large-scale training and inference clusters, especially for bandwidth-bound workloads and inference at low-precision datatypes like FP4.
The move also highlights ongoing industry trends: tighter hardware-software integration, emphasis on interconnects as a differentiator, and increasing use of structured sparsity to boost effective throughput. If AWS really pairs NVLink Fusion with its Graviton CPUs and EFA networking, customers could see more heterogeneous racks that blend AWS-designed silicon with Nvidia interconnect advantages — and potentially avoid a single-vendor lock-in for top-tier interconnect performance.
Why should I read this?
Short answer: because it matters if you care about the future of large-scale AI infrastructure. NVLink on Trainium4 could shake up cluster designs, performance claims and procurement choices — or at least make AWS’s chips far more compatible with existing NVLink-centric software stacks. We’ve skimmed the hype and pulled the bits that affect budgets, scaling and architecture choices, so you don’t have to read the full keynote transcript unless you want the fluff.
Source
Source: https://go.theregister.com/feed/www.theregister.com/2025/12/02/amazon_nvidia_trainium/
