NVIDIA has unveiled an open-source software platform designed to improve the efficiency of artificial intelligence inference, the process that enables AI models to generate responses in real time. The system, called NVIDIA Dynamo, is aimed at reducing computational costs while scaling AI reasoning models across vast GPU infrastructures. As the complexity of AI applications increases, the demand for inference efficiency is becoming a major focus for businesses deploying large-scale AI models.
AI reasoning models, which generate responses based on prompts, require substantial computing resources to process vast amounts of data. With each AI model generating thousands of tokens per query, orchestrating inference across GPUs efficiently has become critical to sustaining commercial AI operations. By introducing intelligent scheduling, routing, and memory management, NVIDIA Dynamo seeks to address these challenges by dynamically optimising GPU usage and reducing redundant computations.
Jensen Huang, founder and chief executive of NVIDIA, described the initiative as a step towards making AI more scalable and cost-efficient. “Industries around the world are training AI models to think and learn in different ways, making them more sophisticated over time,” he said. “To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories.”
Optimising AI infrastructure
Dynamo replaces NVIDIA’s Triton Inference Server and introduces several technical innovations designed to enhance AI processing efficiency. One of its key capabilities is disaggregated serving, which separates the different computational phases of AI reasoning models across GPUs. This means the initial phase, understanding the user query, and the response generation phase are optimised separately, improving performance and resource allocation.
Another innovation is its ability to dynamically reassign workloads based on real-time demand. This allows AI factories, data centres dedicated to AI model training and deployment, to optimise their infrastructure by reallocating GPUs as needed, reducing unnecessary energy consumption and costs. The system can also identify the most relevant GPU nodes to process queries, minimising recomputation and freeing up resources for new tasks.
The efficiency gains are significant. According to NVIDIA, when applied to DeepSeek-R1, a widely used AI model, Dynamo increased processing capacity by 30 times per GPU. For AI models running on NVIDIA’s current Hopper platform, it has doubled the performance of inference workloads. The ability to scale reasoning AI efficiently is particularly crucial as businesses seek to deploy generative AI applications at a global level.
Several leading AI companies and cloud service providers, including AWS, Google Cloud, Meta, Microsoft Azure, and Together AI, are set to integrate NVIDIA Dynamo into their AI inference frameworks. Cohere, a company developing AI models for enterprise applications, plans to use the platform to enhance the performance of its Command model series.
“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication,” Saurabh Baji, senior vice president of engineering at Cohere, said. “We expect NVIDIA Dynamo will help us deliver a premier user experience to our enterprise customers.”
The development of inference-optimised software reflects the increasing pressure on AI companies to improve efficiency as the cost of running AI models continues to rise. The industry is facing growing concerns about energy consumption and sustainability, with AI models requiring more power-intensive computational resources. By introducing a platform that enables better resource utilisation, NVIDIA aims to provide a solution that balances the need for performance with the economic realities of AI deployment.
With AI applications expanding into industries such as healthcare, finance, and manufacturing, the ability to serve AI models efficiently is expected to become a competitive differentiator. NVIDIA Dynamo’s open-source approach may also encourage further innovation, allowing researchers and developers to refine inference-serving strategies. As AI adoption accelerates, software-driven optimisation of compute infrastructure could play a crucial role in determining the future of large-scale AI operations.




