The invisible infrastructure powering AI

Mark Venables

AI In Depth, AI Hardware/Infrastructure, Exclusives

Share this article

Artificial intelligence is evolving at an extraordinary pace, reshaping industries and raising vital questions about sustainability, infrastructure, and cost. Mark Venables examines where AI lives, breathes, and evolves, exploring how technology leaders balance innovation with environmental and economic realities.

Artificial intelligence (AI) has shifted from a domain of scientific research to an integral part of modern business and industry. The evolution of AI infrastructure, which enables this transformation, has mirrored the development of information technology over the past few decades, with an increasing reliance on cloud services and advanced hardware. Yet, as Germain Masse, an expert in digital services eco-design at OVHcloud, explains, this rapid evolution brings significant challenges in sustainability, cost management, and technological adaptability.

“Since the emergence of ChatGPT two years ago, the AI landscape has fundamentally changed,” he says. “What was once confined to a small group of experts has become far more accessible.” Previously, AI development required deep technical knowledge in model training and deployment. Now, two distinct approaches coexist: one where companies train their models, requiring significant infrastructure, and another where they use pre-trained models. “It mirrors the evolution of software, where some focus on creating applications while others simply use them to meet specific needs,” he adds.

This shift has created an AI ecosystem akin to the client-server model that shaped early IT systems. Smaller AI models can run on local devices, improving privacy and reducing latency, while resource-intensive tasks remain in the cloud. “The ability to run models locally is a positive development for privacy, and as devices grow more powerful, this trend will continue,” Masse notes. “However, the cloud is indispensable for handling large datasets and complex AI computations.” He further explains that local processing can also help in regions with poor internet connectivity, enabling organisations to leverage AI capabilities even when cloud access is limited.

The hardware backbone of AI

AI training and inference require hardware beyond traditional CPUs. “GPUs are particularly suited for AI because they can process thousands of simple operations in parallel, unlike CPUs, which are optimised for complex tasks,” Masse explains. Emerging hardware like TPUs and NPUs shows promise but remains largely experimental. Training large models, such as GPT-4 or Meta’s LLaMA, necessitates thousands of GPUs working in tandem, drawing parallels to high-performance computing clusters. “Building this infrastructure involves significant costs and specialised data centres with low-latency networking,” Masse says. “It is not something existing data centres can easily repurpose.” Consequently, only a few organisations or governments can afford to train these massive models, leading to centralisation.

Expanding on the infrastructure demands, Masse highlights the importance of energy-efficient cooling systems. “Data centres running AI workloads consume vast amounts of energy not just for computation but also for cooling,” he says. “Innovative solutions like immersion cooling and advanced airflow systems can drastically reduce this footprint.” OVHcloud, for example, has invested in water-cooling technology that reduces the overall energy consumption of its facilities.

Balancing local and cloud-based AI

Whether to run AI workloads on-premises or in the cloud is complex. Running AI on-premises only makes sense if the hardware is fully utilised. Otherwise, the costs and environmental impact are prohibitive. Cloud services offer flexibility and scalability unmatched by physical infrastructure, allowing businesses to scale GPU usage as needed.

Certain sectors, like healthcare and defence, require on-premises solutions for data privacy, especially when dealing with sensitive information such as genetic data. Video surveillance is another area where local inference is essential due to the volume of real-time data. “Despite these exceptions, the cloud remains the most practical choice for most use cases,” Masse adds.

AI as a Service (AIaaS) has gained traction, enabling organisations to use AI capabilities without managing the underlying hardware. “It follows the same trajectory as cloud computing adoption,” Masse explains. “Companies increasingly prefer to consume AI services rather than invest in rapidly evolving hardware. With GPU technology advancing quickly, substantial investments in a single hardware generation are risky. Our approach at OVHcloud is incremental investment. Older GPUs, like the V100 series from 2016, are still effective for certain inference tasks, so we continue to use them where appropriate.” This strategy not only maximises hardware lifespan but also maintains cost efficiency.

Masse also emphasises the importance of software optimisation. “Efficient code can significantly reduce the computational load,” he says. “Companies often overlook software improvements in favour of hardware upgrades, but optimising algorithms can lead to substantial energy and cost savings.”

Sustainability at the forefront

AI’s environmental impact, particularly during the training phase, is a growing concern. Training large models consumes vast amounts of energy. There is a real question about whether we are heading in the right direction. OVHcloud addresses this by sourcing energy from low-carbon sources, primarily nuclear and renewables, in France. “We are committed to sustainability, but globally, the energy demands of AI remain a challenge,” Masse adds. “Reducing AI’s footprint requires careful consideration of its application. We need to ask if AI is the right tool for every task. Traditional algorithms can often achieve similar results with far less energy,”

Developing smaller, specialised models can also improve efficiency. OVHcloud focuses on deploying models tailored to specific tasks to optimise energy use. “It is about using the right model for the right job,” he says. “We are also exploring the potential of federated learning, a technique that allows models to be trained across multiple devices without centralising data, further reducing data movement and associated energy use.”

The lack of standardised metrics for data centre sustainability complicates matters. Cloud providers use different metrics, making it hard for customers to compare. The best indicator remains the location and local energy mix. Data centres in regions with abundant renewable or nuclear energy, like the Nordics or Quebec, are ideal for energy-intensive AI workloads. However, Masse concedes that we need an industry-wide standard to measure and compare sustainability efforts.

Cost considerations and cloud accessibility

Data movement costs present another challenge. For industries like animation, the expense of transferring data in and out of the cloud can be prohibitive. “OVHcloud addresses this by including bandwidth in its pricing, offering transparency and predictability,” Masse continues. “Many hyperscalers have complex pricing models, leaving customers unsure of their costs. We prioritise simplicity to support clients with large data needs.”

Long-term investments in AI infrastructure raise questions about return on investment. Companies like Microsoft are betting on AI technologies with a 15-year monetisation horizon, a strategy that worries some analysts. OVHcloud, however, focuses on supporting smaller model training, fine-tuning, and inference, which require less capital-intensive infrastructure. “It is about being realistic with investments and focusing on what delivers value now,” Masse says.

Moreover, Masse highlights the challenge of hardware obsolescence. “With GPUs evolving so quickly, companies can find themselves with outdated infrastructure within a year or two. Leasing hardware or using cloud-based GPUs can mitigate this risk,” he explains. This approach not only saves costs but also aligns with sustainability goals by extending the useful life of hardware across multiple users.

The future of AI infrastructure

As AI evolves, so too will its infrastructure. The growing capability of local devices will enable more on-device processing, reducing reliance on centralised data centres for certain tasks. Yet, the cloud will remain crucial for large-scale training and complex inference operations. “We will see a hybrid model emerge, balancing local and cloud-based processing depending on the use case,” Masse predicts. He adds that edge computing will play a vital role in sectors like autonomous vehicles and smart cities, where low-latency processing is essential.

Sustainability will continue to shape the development of AI infrastructure. Companies must justify the environmental impact of large-scale model training, while cloud providers must prioritise green energy and efficient hardware utilisation. “AI’s future depends on making smart choices about where and how it runs,” Masse concludes. He calls for greater collaboration between industry players, governments, and regulators to develop policies encouraging sustainable AI deployment.

Ultimately, the infrastructure supporting AI is evolving as rapidly as the technology itself. Balancing technological advancement with environmental responsibility and economic viability will be critical. As businesses increasingly rely on AI, understanding where it lives, breathes, and evolves will become essential to navigating the digital landscape responsibly and effectively.