At GTC 2026, NVIDIA confirmed that its Vera Rubin platform is now in full production, bringing together seven tightly integrated chips—the Vera CPU, Rubin GPU, NVLink 6, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-6 Ethernet and Groq 3 LPU—into a single, rack-scale system designed to handle both training and inference as one coherent fabric. This marks a strategic pivot away from loosely assembled GPU clusters towards vertically integrated AI “factories” where compute, networking and storage are co-designed for sustained utilisation and token throughput. For boards and CIOs, the message is clear: competitive AI infrastructure is becoming a system-level decision, not a GPU procurement exercise.
NVIDIA’s NVL72 rack lies at the heart of this shift, unifying 72 Rubin GPUs and 36 Vera CPUs via NVLink 6 to maximise density and cut cost per token, with the company claiming up to 10x higher inference throughput per watt and materially fewer GPUs needed for mixture-of-experts training versus the prior generation. The architecture is explicitly optimised for agentic AI and long-context workloads, signalling that future-ready data centres must be architected around large, persistent context windows and continuous reasoning rather than batch-style jobs.
Vera CPU and Groq 3 LPU Formalise a Two-Stage Inference Model
A notable departure from traditional designs is the Vera CPU’s elevation from “supporting” role to orchestration engine for reinforcement learning environments, tools and agent workflows running adjacent to GPU workloads. With 88 custom cores and high-bandwidth LPDDR5X memory, NVIDIA positions the CPU as the substrate for running thousands of simulated environments and system agents, claiming around 2x efficiency and 50% faster performance than conventional rack-scale CPUs for these tasks. Strategically, this reflects an emerging view that the economic value in AI will hinge as much on orchestration, simulation and tool use as on raw model size.
On the inference side, NVIDIA is materialising its December 2025 deal with Groq by integrating the Groq 3 LPU into the Vera Rubin stack, creating an LPX rack with 256 LPUs tuned for ultra-low-latency, long-context decoding. The result is a formalised two-stage inference pipeline: GPU-heavy prefill on Rubin for context construction, followed by LPU-driven decode to accelerate token generation at lower power and higher throughput. NVIDIA claims the combined system can deliver up to 35x higher inference throughput per megawatt for trillion-parameter models with million-token context, redefining the economics of large-scale inference for foundation model providers.
Standardising AI Factories and Decoupling Bottlenecks
Beyond compute, NVIDIA is addressing previously peripheral bottlenecks that are now central at AI factory scale. BlueField-4 STX introduces a dedicated KV-cache storage layer to reduce memory pressure on GPUs, while Spectrum-6 is engineered for high-throughput east-west traffic across large clusters, enabling sustained utilisation rather than sporadic bursts. This points towards infrastructure that is tightly coupled in design yet functionally specialised, aligning with how hyperscalers already treat networking, storage and security as first-class optimisation domains.
To reduce deployment friction, NVIDIA has introduced the Vera Rubin DSX reference design, defining standard blueprints for compute, power, cooling and networking in AI factories, complemented by DSX Max-Q and DSX Flex for power optimisation and grid integration. In parallel, Omniverse-powered digital twins allow operators to simulate thermals, power draw and workload behaviour before build-out, effectively moving failure discovery and capacity planning into a virtual pre-production stage. For operators in power-constrained geographies, this upstream validation and grid-aware design are increasingly non-negotiable.
Strategic Implications for Cloud, Model Developers and Enterprises
Cloud providers including AWS, Google Cloud, Microsoft Azure and Oracle, and OEMs like Dell and Supermicro, are expected to roll out Vera-based systems, while leading model labs such as OpenAI, Anthropic, Meta and Mistral AI are aligning with the platform for training and long-context inference. Anthropic’s Dario Amodei has already framed Vera Rubin as providing the compute, networking and system design needed to support more complex reasoning and agentic workflows, underscoring how foundational infrastructure choices are becoming intertwined with safety and reliability commitments at the application layer.
For senior leaders, three strategic themes stand out. First, AI infrastructure is consolidating into “factory” paradigms, where integrated systems and reference designs replace ad hoc clusters. Second, architectures optimised for agentic AI, long context and two-stage inference will shape the economics of both model training and commercial deployment. Third, alignment with emerging de facto platforms like Vera Rubin could influence ecosystem access—from model partnerships to cloud pricing—making early decisions on architecture and partners a material lever for AI competitiveness through the second half of this decade.
