The economics of AI are shifting from models to the cost of every token

Mark Venables

The Week in AI, AI Hardware/Infrastructure

Share this article

Artificial intelligence has long been defined by advances in model capability, but a new phase is emerging where the economics of running those models is becoming just as significant as the intelligence they deliver. A new announcement centred on inference providers and NVIDIA’s latest hardware platform suggests that the next wave of AI competition may be decided less by model architecture and more by how cheaply and efficiently tokens can be generated at scale.

According to the announcement, leading inference providers including Baseten, DeepInfra, Fireworks AI and Together AI are reporting substantial reductions in inference costs by deploying open source models on the NVIDIA Blackwell platform. In some cases, cost per token is said to fall by as much as tenfold compared with previous deployments built on the Hopper platform.

The development highlights a broader shift in the AI landscape. As organisations move from experimentation to production, the volume of tokens generated by applications, from medical assistants and games to customer service agents, becomes a critical financial factor. The implication is clear: if AI systems are to scale across industries, token economics must improve dramatically.

From model performance to token economics

The announcement frames this change as a question of efficiency rather than capability. Recent research referenced in the release suggests that infrastructure and algorithmic improvements are driving annual reductions in inference costs, meaning that businesses can deploy more AI interactions without proportional increases in spending.

In healthcare, the collaboration between Baseten and Sully.ai illustrates how these gains translate into operational outcomes. By moving from proprietary closed models to open source models running on Blackwell hardware, Sully.ai reported a 90 per cent reduction in inference costs and improved response times in clinical workflows. The company says this has returned millions of minutes to physicians by automating routine tasks such as medical coding and note-taking.

Gaming offers a different perspective on the same trend. Latitude, the developer behind AI-driven interactive experiences, uses open source models on DeepInfra’s platform to manage the constant inference demands created by player interactions. The shift to Blackwell hardware reportedly reduced the cost per million tokens significantly while maintaining response speed, a critical factor for real-time gameplay.

Meanwhile, in agentic AI systems, Sentient Foundation deployed inference through Fireworks AI to support multi-agent workflows. According to the announcement, this resulted in cost efficiencies of between 25 and 50 per cent compared with earlier infrastructure, enabling the system to support large spikes in demand during launch periods.

Infrastructure becomes the competitive edge

The common thread across these examples is not simply hardware performance but the interaction between open source models, optimised inference stacks and infrastructure designed specifically for AI workloads. In customer service, Decagon worked with Together AI to deliver voice AI responses in under 400 milliseconds while reducing overall query costs. Techniques such as speculative decoding and automated scaling were cited as contributors to those gains.

Taken together, the examples suggest that AI’s next efficiency frontier lies in system-level optimisation. The Blackwell platform is described as achieving higher throughput per dollar and lower token costs by combining hardware, networking and software design into a unified stack. This approach reflects a growing belief that inference, rather than training, will drive long-term economics as AI moves into continuous, real-world use.

A new phase of AI adoption

The announcement also signals a broader industry trend toward open source models reaching what is described as frontier-level intelligence. If businesses can achieve comparable performance while significantly lowering operating costs, the balance between proprietary and open systems may shift.

For enterprises, the implications extend beyond technical benchmarks. AI systems operating in healthcare, gaming or customer support rely on predictable cost structures to remain viable at scale. Reducing the price of each token effectively widens the scope of what AI can be deployed to do, from more frequent interactions to more complex reasoning workloads.

As AI adoption spreads, the discussion is moving away from model novelty toward infrastructure efficiency and deployment economics. The message emerging from this latest set of deployments is that the future of AI may be shaped less by who builds the smartest model, and more by who can afford to run it continuously in the real world.