CoreWeave sets new pace for AI performance with groundbreaking benchmark

Mark Venables

The Week in AI, AI Hardware/Infrastructure, AI Solutions

Share this article

The infrastructure powering the world’s most demanding artificial intelligence workloads is changing fast. This week, CoreWeave, one of the leading providers of specialised cloud infrastructure, announced a new benchmark in AI inference that could reshape expectations for high-performance generative AI deployment.

Using the latest NVIDIA GB200 Grace Blackwell Superchips, CoreWeave achieved 800 tokens per second on Meta’s open-source Llama 3.1 405B model, one of the largest and most complex large language models (LLMs) publicly available. The result, published in the MLPerf v5.0 benchmark suite, marks a significant leap in inference speed and highlights the accelerating competition among infrastructure providers racing to support next-generation AI.

Inference, the process of running trained machine learning models to generate results, remains a critical bottleneck in scaling generative AI to production. While training a model grabs headlines, it is the ability to serve millions of users with consistent speed and reliability that determines its commercial viability. In that context, CoreWeave’s benchmark performance may prove a bellwether for the capabilities enterprises will soon demand as standard.

“CoreWeave is committed to delivering cutting-edge infrastructure optimised for large-model inference,” said Peter Salanki, Chief Technology Officer at CoreWeave. “These benchmark MLPerf results reinforce our position as a preferred cloud provider for leading AI labs and enterprises.”

CoreWeave also released results from NVIDIA H200 GPU instances, showing a 40 percent improvement in throughput over H100 chips when running the Llama 2 70B model. In that test, the company achieved 33,000 tokens per second, a figure that underscores how incremental GPU advancements are having compounding effects on performance when integrated into optimised infrastructure.

The rapid evolution of hardware capabilities, particularly in the data centre, is now tightly coupled with the needs of AI models that are growing in both size and complexity. CoreWeave’s early adoption and general availability of NVIDIA GB200 NVL72-based instances suggests that the company is not merely keeping pace with technological developments but actively positioning itself as a platform for the next phase of AI deployment.

While the MLPerf benchmarks are synthetic, they are widely regarded as the gold standard for comparing AI systems under consistent conditions. The benchmarks test not only raw speed but also reflect real-world use cases such as speech recognition, image classification, object detection, and LLM inference.

What these latest results make clear is that the performance ceiling for AI inference is being pushed upward at speed. For developers and enterprises seeking to deliver responsive, reliable AI services, from chatbots to copilots and enterprise search tools – the infrastructure layer is increasingly a strategic concern.

It is also a differentiator. As cloud providers diversify their offerings, many are now marketing purpose-built AI platforms, with low-latency interconnects, custom scheduling software, and tightly coupled CPU-GPU architectures. The results published by CoreWeave not only highlight the promise of NVIDIA’s GB200 architecture but also the critical role of system-level optimisation in realising that promise.

The AI infrastructure race is no longer only about who can offer the most GPUs. It is about who can orchestrate them most effectively.

Founded in 2017, CoreWeave has evolved from a niche compute provider into one of the most closely watched players in the AI infrastructure space. With a growing footprint of data centres across the US and Europe, and recent recognition in the TIME100 and Forbes Cloud 100 rankings, the company appears poised to play a central role in shaping the next generation of compute for artificial intelligence.

The benchmark results offer more than a technical milestone. They provide a glimpse into how AI workloads may soon be powered – faster, more efficiently, and with a level of specialisation that separates traditional cloud computing from the demands of AI at scale.