Acceleration unlocks a new layer of openness in generative AI infrastructure

Mark Venables

Share this article

A fundamental breakthrough in language model inference could reshape how developers deploy generative AI, making large models faster, cheaper and more flexible without needing proprietary pairings or retraining. Researchers at Intel Labs and Israel’s Weizmann Institute of Science have unveiled a new class of algorithms that allow any small draft model to accelerate the output of any large language model (LLM), regardless of how or where each model was trained.

Presented this week at the International Conference on Machine Learning (ICML) in Vancouver, the research significantly expands the application of a technique known as speculative decoding. In simple terms, speculative decoding allows a small, fast model to generate a draft response that a larger, slower model verifies, dramatically reducing the computational burden of large-scale inference.

But until now, that collaboration between models was constrained: it only worked effectively between systems sharing similar vocabularies or co-trained families. Intel and Weizmann’s research changes that.

Performance without lock-in

At its core, speculative decoding is an optimisation strategy to reduce the latency and cost of generating responses from LLMs. In a conventional approach, every word is generated in sequence by the main model, each requiring substantial compute power. Speculative decoding shifts that burden by letting a lighter-weight model quickly guess a series of likely words, which the heavier model then checks, rather than generating from scratch.

Previous implementations were limited by vocabulary overlap or architectural alignment. Intel and Weizmann’s contribution removes that barrier entirely, making speculative decoding vendor-agnostic and model-agnostic. That unlocks cross-ecosystem compatibility for the first time, allowing developers to pair best-in-class small models with state-of-the-art large models from different providers.

The speed gains are substantial. According to results presented at ICML, the approach delivers up to 2.8 times faster inference compared to baseline models, with no compromise in output quality. The algorithms are already integrated into Hugging Face’s Transformers library, making them accessible to millions of developers without custom coding.

Making generative AI affordable and open

As enterprises struggle to manage the cost and complexity of LLM deployments, particularly at inference stage, this shift could have significant implications. The ability to decouple models and still optimise performance opens a more modular, efficient approach to AI infrastructure, especially at the edge and in hybrid environments where compute is limited.

“This isn’t just a theoretical improvement,” said Oren Pereg, senior researcher in Intel’s Natural Language Processing Group. “Our research shows how to turn speculative acceleration into a universal tool. These are practical tools that are already helping developers build faster and smarter applications today.”

That universality is key. In a cloud market dominated by hyperscalers and tightly integrated software stacks, Intel and Weizmann’s approach puts powerful optimisation capabilities in the hands of independent developers and enterprises who might otherwise lack the resources to fine-tune or pair models internally.

For example, a firm might use a smaller open-source draft model hosted locally to accelerate inference from a larger proprietary model running in the cloud, reducing costs and maintaining control over data.

From paper to deployment

Behind the breakthrough are three new algorithms that decouple speculative decoding from shared vocabularies entirely. This means that even models using different tokenisation systems or trained on unrelated datasets can be paired for collaborative inference, sidestepping one of the most persistent bottlenecks in generative AI deployment.

According to Nadav Timor, Ph.D. student at the Weizmann Institute, the result is a democratisation of access to LLM performance. “Our algorithms unlock state-of-the-art speedups that were previously available only to organisations that train their own small draft models,” he said.

This could have far-reaching consequences for how the next generation of AI applications are built and run. With speculative decoding no longer limited to vertically integrated systems, developers are free to optimise performance based on workload requirements rather than vendor ecosystems.

In the race to scale generative AI responsibly, affordably and efficiently, it may not be new models that matter most, but better ways to use the ones we already have.