The ecosystem engine behind the AI factory

Mark Venables

AI In Depth, AI Factories, AI Hardware/Infrastructure, Exclusives

Share this article

An AI factory does not fail at full load. It fails much earlier, in the handoffs between systems that were never designed to operate as one. As power densities climb and deployment timelines collapse, the industry is discovering that the real risk is not whether the technology works, but whether the infrastructure can hold together under pressure.

There is a moment in every large-scale deployment where the design meets reality. It does not happen when the GPUs are switched on, or when the first workload is run, but much earlier in the process. It happens during commissioning, when electrical, thermal, and control systems are forced to operate together for the first time, exposing every inconsistency, every assumption, and every weakness in how the system has been assembled.

“The biggest issues are not necessarily the technologies themselves,” Martin Olsen, Vice President, Segment Strategy & Deployment, Vertiv, says. “It is all within the seams. It is all in between all these different products and how it all fits together. We have been used to bringing a lot of different components together and making them operate as a system, but that gets really challenging once you get to these kinds of densities and the problems that come with them. It is not the individual products; it is how everything connects.”

At the scale of modern AI infrastructure, those seams have become the defining constraint rather than a secondary engineering consideration. What was once a manageable integration challenge is now a systemic risk, amplified by the speed at which capacity must be delivered and the financial pressure attached to that delivery. The result is an environment where infrastructure must perform as a complete system from the outset, rather than being assembled incrementally and stabilised over time.

“Time to capacity is by far the number one driver,” Olsen adds. “We have not seen that slow down, and there are good reasons for that. Billions of dollars are being spent, and it needs to get up and running very, very quickly. At the same time, you are dealing with workloads that behave very differently. You get sharp spikes in power draw, going from zero to 140 percent of capacity very quickly, and that puts strain not just on the electrical system but also on the thermal side, which has to react at the same speed.”

The infrastructure challenge is further compounded by the imbalance between compute and the systems required to support it. Mechanical, electrical, and plumbing layers are expanding at a rate that fundamentally changes the nature of the problem, shifting the centre of gravity away from silicon and towards integration. This is not a marginal shift, but a structural change in how AI infrastructure must be designed and delivered.

“As we scale up in density, we are not just scaling compute,” Olsen continues. “Mechanical, electrical, and plumbing systems are increasing dramatically relative to compute. It is not a linear relationship. You are introducing complexity across every layer of the infrastructure, and that complexity must be managed as a system. Otherwise, it becomes the limiting factor.”

Density breaks the model

The increase in rack density is not simply an engineering milestone or a continuation of historical trends. It represents a point at which traditional approaches begin to fail under the weight of complexity and interdependence. As density increases, the tolerance for error decreases, and the consequences of misalignment between systems become more severe.

“We had the first test with systems going to about 140 kilowatts in a rack,” Olsen explains. “From there, it moves to over 300 kilowatts, then 600, and we are already designing for a megawatt in a rack. You can see how quickly that grows, and one of the biggest issues is not the portfolio or the technology itself. It is how everything fits together across those densities.”

The behaviour of AI workloads only intensifies this challenge, introducing volatility that infrastructure must absorb in real time. These workloads do not follow predictable patterns, and their impact is felt simultaneously across power and thermal systems. This creates a level of dynamic stress that legacy infrastructure models were never designed to handle.

“If you look at GPU training workloads, you get very sharp spikes in power draw,” Olsen says. “It is not just a megawatt. It is several hundred megawatts that will go from zero to 140 percent of capacity very quickly. That puts strain on the electrical system, but also on the thermal side. Power in equals heat out, so you see the corresponding output on the thermal side that you need to react to at the same speed, and thermal systems have never been designed to react that rapidly.”

This combination of density and volatility forces a shift in how infrastructure is conceived and validated. It can no longer be treated as a collection of systems that are integrated after the fact. Instead, it must be designed, tested, and proven as a coordinated whole before it is ever deployed.

The problem between the systems

The industry has already begun adopting new technologies such as liquid cooling, but this does not resolve the underlying issue. The real challenge lies not in individual technologies, but in how they are brought together and made to operate as a unified system. This is where traditional delivery models begin to break down.

“If you take something like the secondary fluid network, it looks simple once it is put together, but it is a combination of a lot of individual work and a lot of field work,” Olsen explains. “You have the CDU, the manifolds, the piping, everything going to the servers. When you go out and commission one of these systems, some of the things you find are debris, tools, all kinds of issues. It is simply because it has been built as a construction site rather than as a product.”

What appears to be a localised issue quickly becomes systemic as complexity increases. Small inconsistencies introduced during assembly can propagate through the system, affecting performance and reliability over time. At the scale of AI infrastructure, these effects are amplified rather than contained.

“If you do not get that right from day one, the reliability curve gets very narrow,” Olsen adds. “You start to see problems early, and those problems propagate throughout the system. That is just one part of the infrastructure, and it shows how the seams become the biggest challenge.”

From craftsmanship to industrialisation

The roots of the problem lie in how data centres have historically been designed and constructed. The industry evolved around a model that relied heavily on experience, tacit knowledge, and site-based problem solving. While effective at lower densities, this approach does not scale to the requirements of AI.

“Ten years ago, this industry was still very much a craftsmanship model,” Olsen says. “It relied on institutional knowledge to bring all these pieces together. It was product oriented, designed once and then repeated manually at every site. There was not a lot of repeatability in it.”

That model introduces variability at precisely the point where consistency is required. As systems become more complex, the margin for error narrows, and the impact of small deviations increases. This makes the reliance on manual processes increasingly problematic.

“If you look at a megawatt of AI factory infrastructure, about half of the cost is labour,” Olsen explains. “It is design work, engineering work, installation work. Half of it is people walking around doing individual jobs. Anytime you have that level of human interface, there is a propensity for problems, and that introduces variability into the system.”

The response is to shift towards industrialisation, where infrastructure is treated as a repeatable, engineered product. This involves moving work upstream into controlled environments and reducing reliance on site-based assembly. It also requires a fundamental change in how systems are designed and delivered.

“We need to drive repeatability into the products themselves and continue to industrialise more of the infrastructure,” Olsen continues. “That is how you reduce variability and increase reliability. Instead of building everything on site, you bring it into the factory, you productise it, and you deliver it as a controlled system.”

Designing the system before it exists

This transition changes not only how infrastructure is built, but how it is designed. The traditional balance between design and deployment is being reversed, with greater emphasis placed on validation before construction begins. This reduces risk and improves predictability at scale.

“Traditionally, about 20 percent of the time has been spent on design and 80 percent on the physical deployment. We are flipping that,” Olsen ads. “We are spending 80 percent of the time getting it right in a virtual environment, and then the last 20 percent on deployment.”

Digital twins and simulation environments enable this shift by allowing entire systems to be modelled and tested in advance. This includes performance validation, failure scenarios, and interactions between different subsystems. The result is a far higher degree of certainty before physical deployment.

“You can build the full virtual model and run it through its paces before ordering a single product,” Olsen says. “You can simulate performance, failure scenarios, and interactions across the entire system. That allows you to resolve issues before they become real-world problems.”

This approach also creates continuity between design and operation. The digital twin becomes a living model that supports optimisation and maintenance over time. This ensures that the system behaves as intended, not just at commissioning, but throughout its lifecycle.

“The digital twin becomes the design record,” he explains. “It lives with the system, not just for day zero, but through day one and day two operations. That is how you ensure the system behaves as intended and how you continue to optimise it over time.”

Orchestrating the AI factory

As infrastructure becomes more interconnected, coordination must extend beyond individual domains and technologies. No single provider controls every element of the system, yet all elements must work together seamlessly. This creates a need for orchestration across the entire infrastructure stack.

According to Olsen there will always be technologies outside of what Vertiv directly provide, such as power generation. “Whether it is on-site generation, turbines, or other systems, it has to be integrated into the overall infrastructure,” he continues. “That requires an orchestration layer that can manage the entire system.”

Without this level of coordination, inefficiencies accumulate and performance is compromised. Systems are often overdesigned to compensate for uncertainty, leading to unused capacity and increased cost. These inefficiencies compound as scale increases.

“If you are not designing at that level, you typically end up with around 20 percent stranded capacity,” Olsen explains. “That is because one part of the system is oversized relative to another, and those inefficiencies compound as you move downstream. You end up with capacity that you cannot fully use.”

The same principle applies to energy efficiency and reuse, where opportunities are often lost due to fragmented design. Heat generated within the system can be captured and reused, but only if the infrastructure is designed with that objective in mind. This requires coordination across power, cooling, and control systems.

“There is an enormous amount of heat coming off these systems,” Olsen says. “If you think about power generation, there is heat that can be recaptured and reused. But you can only do that if you are thinking about the system as a whole, not as individual components.”

Building infrastructure as a product

The culmination of this approach is the transition from bespoke construction to productised infrastructure. Systems are designed as integrated units, with defined interfaces and repeatable components. This reduces variability and enables consistent performance across deployments.

“Olsen explains that Vertiv have taken the approach of building this as a converged physical infrastructure. “That means repeatable building blocks, defined interfaces, system orchestration, digital continuity, and lifecycle assurance,” he adds. “It is not about putting pieces together and hoping they work. It is about designing the system to operate as one from the beginning.”

This model allows infrastructure to be delivered as a complete, prefabricated system rather than assembled on site. Components are manufactured, tested, and validated before deployment, reducing risk and accelerating timelines. The result is a more predictable and scalable approach to building AI factories.

“We are productising the entire infrastructure, from the power train to the thermal chain to the white space itself< Olsen says. “That includes prefabricated modules, factory-tested building blocks, and defined interfaces. Instead of constructing everything on site, you are deploying a system that has already been engineered and validated.”

The benefits are significant and measurable, particularly in terms of speed, efficiency, and cost. By removing variability and improving integration, organisations can deliver capacity more quickly and with greater confidence. This is critical in an environment where time to capacity directly impacts value.

“You can achieve up to 50 percent faster deployment, significantly less on-site work, a smaller footprint, and a lower total cost of ownership,” Olsen explains. “Those improvements come from removing variability and ensuring that the system works as intended before it is deployed.”

Scaling to the gigawatt

At scale, this approach becomes essential rather than optional. The complexity of gigawatt-scale AI factories cannot be managed through traditional methods. Standardisation, prefabrication, and system-level design are required to deliver and operate these environments effectively.

Olsen explains that they are designing for gigawatt-scale AI factories. “That requires a level of coordination and repeatability that cannot be achieved through traditional methods,” he continues. “You need standardised building blocks that can scale from smaller deployments all the way up to those levels.”

Olsen adds that at the same time, infrastructure must remain adaptable to future changes in technology. Compute will continue to evolve, and infrastructure must support that evolution without requiring fundamental redesign. This requires a separation between infrastructure and the systems it supports. “You design once and deploy everywhere, but you also maintain optionality,” he says. “You can upgrade the compute without fundamentally changing the infrastructure. You take the rack out and install the new one, and the system continues to operate.”

The AI factory is no longer a static asset, but an evolving system that must support continuous iteration. The organisations that succeed will be those that can manage this complexity while maintaining performance and reliability. The challenge is no longer whether the infrastructure can be built, but whether it can be built as a system.

“The question is no longer if this future will be built, but how quickly, and by whom,” Olsen concludes. “The organisations that can bring together the ecosystem and make it operate as a single system will be the ones that define the next phase of AI infrastructure.”