NVIDIA raises the stakes as AI inference enters its industrial phase

Mark Venables

The Week in AI, AI Factories, AI Hardware/Infrastructure

Share this article

As artificial intelligence shifts from experimental models to full-scale production, the economic engine powering it, inference, is entering a new phase. A new set of independent benchmarks published this week places NVIDIA’s Blackwell platform at the centre of this evolution, with performance metrics that redefine what AI factories can expect from their compute investments.

The InferenceMAX v1 benchmarks, developed by SemiAnalysis, are the first to evaluate AI systems based on total cost of compute across real-world scenarios. NVIDIA’s Blackwell architecture swept the board, delivering not only the highest throughput but also the best overall efficiency, lowest cost per token, and most compelling return on investment. These are not academic distinctions. They signal a structural shift in how enterprises evaluate and scale generative AI.

A single $5 million deployment of the NVIDIA GB200 NVL72 system, for example, is projected to generate $75 million in token-based revenue, a 15x return. While such figures are modelled on current demand and use cases, they highlight the financial logic now driving infrastructure choices as AI transitions from research to revenue.

The new economics of inference

Unlike traditional AI benchmarks focused on training speed or peak FLOPS, InferenceMAX prioritises sustained throughput, responsiveness, and operational cost. These metrics reflect the growing importance of inference, the process of running models to produce real-time outputs, in powering enterprise AI applications.

Blackwell’s architecture is optimised for precisely this. It combines hardware innovations like NVFP4 low-precision formats and fifth-generation NVLink interconnects with advanced software stacks including NVIDIA TensorRT-LLM and vLLM. This tight integration allows for near-linear scaling across massive GPU clusters and unlocks real-time responsiveness even on large language models with tens of billions of parameters.

In practical terms, the system can deliver over 60,000 tokens per second per GPU on gpt-oss and maintain user interactivity at 1,000 tokens per second. For densely parameterised models such as Llama 3.3 70B, Blackwell delivers four times the per-GPU throughput of the previous H200 generation.

This performance is not isolated to one-off scenarios. NVIDIA has emphasised that the gains result from co-designed software and hardware development, with performance improvements continuing post-launch through software updates alone.

Cost, power and scale are the new benchmarks

As AI models become more sophisticated, the infrastructure behind them must balance speed with economic and energy constraints. One of the most significant findings from InferenceMAX is that Blackwell enables 10 times the throughput per megawatt compared with its predecessor. In environments where power is a limiting factor, particularly in hyperscale deployments, this kind of efficiency could be decisive.

The cost per million tokens, a crucial measure for AI economics, has dropped by a factor of 15 on the Blackwell platform. This allows organisations to expand AI services without incurring proportionate cost increases, and to align pricing models more closely with real-time usage.

Crucially, Blackwell’s results are not limited to proprietary models. NVIDIA has partnered with the open-source community, including OpenAI’s gpt-oss, Meta’s Llama 3, and DeepSeek AI, to ensure that high-performance inference is not limited to closed ecosystems. This positions Blackwell as a key player in the emerging AI infrastructure stack, where openness, performance and scale must coexist.

From platform to production

The wider implication of these benchmarks is that inference is no longer a backend detail, it is the front line of AI’s commercial viability. Enterprises building AI factories must now evaluate infrastructure not just by capacity, but by how well it converts power and investment into real-world outcomes.

Ian Buck, vice president of hyperscale and high-performance computing at NVIDIA, put it succinctly: “Inference is where AI delivers value every day,” he said. As AI becomes embedded in products, services, and operations, the systems capable of sustaining that value will shape the next phase of competition.

Blackwell’s dominance in the InferenceMAX benchmarks suggests that full-stack optimisation, from chip design to open-source kernel integration, may now be the price of entry into the industrial AI era. For enterprises navigating the economics of tokens, latency and energy use, the message is clear: inference is no longer a constraint. It is the differentiator.