The hidden limits behind modern compute scale

Mark Venables

AI In Depth, AI Hardware/Infrastructure, Exclusives

Share this article

AI infrastructure is being constrained not by software or silicon, but by the physical systems that connect them. Cabling, optics, and density are no longer implementation details, they are defining what AI architectures can actually be built.

The assumption that AI scales cleanly with compute is beginning to fail in places that are difficult to see until it is too late. Systems are designed around GPUs, switching fabrics, and software frameworks, yet the limiting factor often sits beneath that abstraction. It is found in the way systems are physically connected, in the tolerances of fibre, in the geometry of racks, and in the practical reality of installing and maintaining thousands of connections in confined space.

This is not a theoretical constraint that might appear at some future threshold. It is already influencing how clusters are built, how quickly they can be deployed, and whether they perform as expected once they are operational. The industry continues to talk about architectures in terms of software-defined capability, but the boundaries are increasingly being set by what can be constructed and sustained in the physical layer.

The problem is that these constraints do not announce themselves early. They surface at scale, often after significant investment has already been made. By that point, the architecture is no longer flexible. It has been fixed in place by decisions that seemed secondary at the time.

Where the network breaks

What has changed with AI is not simply the volume of data moving through networks, but the pattern of that movement. Training workloads distribute data across vast numbers of GPUs, each performing local computation before synchronising results with others. That synchronisation phase is where the system becomes exposed.

Patrick McCabe, Director of marketing for AI networks at Nokia, describes the behaviour in terms that remove any ambiguity. “You break up a lot of data across hundreds and thousands and tens of thousands and even hundreds of thousands of GPUs,” he says. “They process locally, but then that data has to be exchanged across the cluster, and that creates massive spikes in the network, well beyond what you would expect in traditional environments.”

Those spikes are not incidental. They are intrinsic to how large models are trained. The network is subjected to bursts of traffic that are both intense and unpredictable, creating conditions where congestion is not a risk but an expectation unless the system is designed to absorb it.

“The network needs to deliver that information in a timely and lossless manner,” McCabe continues. “If you cannot do that, you are costing the operation money, and a lot of money, because the cost of running these environments is already extremely high.”

The cost is not confined to inefficiency. Packet loss can force training processes to restart from earlier states, effectively discarding expensive computation. Reliability becomes a financial parameter rather than a technical metric, and the margin for error narrows accordingly.

There is a tendency to treat this as a problem of switching architecture or protocol design. That is only part of the picture. The ability of the network to behave in this way is tied directly to the physical infrastructure that carries the traffic. Latency, loss, and throughput are not abstract properties. They are influenced by the quality of the connections, the density of the deployment, and the practical limitations of how systems are assembled.

The reality of density

The conversation around scaling AI infrastructure often centres on bandwidth and speed, as if those were the primary obstacles. In practice, the difficulty lies elsewhere. Henry Franc, Digital Automation Consultant for data centres at Belden, draws attention to a constraint that is less visible but more immediate. “The cable itself is not typically the issue. We can make cables smaller. The problem is the connectorisation and the ability to deploy and maintain those connections at scale.”

This distinction alters how the problem should be understood. Fibre can be manufactured in extremely dense configurations, but the act of connecting, cleaning, and managing those fibres becomes increasingly complex as density rises. Each connection point introduces potential for failure, and each additional layer of density makes the system harder to work with once it is live.

“There is almost an arms race to build the biggest fibre cable,” Franc says. “But the question is why. You increase density, you increase complexity, and that brings cost and longer lead times. The real challenge is not the number of fibres; it is how you make that density constructible and maintainable.”

That emphasis on constructability is often missing from architectural discussions. It is possible to design a system that performs exceptionally well on paper but proves difficult to build or sustain in practice. The gap between those two states is where projects slow down, costs escalate, and performance degrades.

The issue becomes more pronounced as data centres evolve. GPU clusters are being deployed alongside legacy CPU infrastructure, with additional layers of complexity introduced by emerging technologies. The physical environment becomes heterogeneous, but the expectation is that it will operate with the precision of a uniform system.

Franc points to the risk of treating every deployment as a bespoke exercise. “People are putting infrastructure in ad hoc, saying they do not need structured cabling,” he adds. “But if you treat everything as variable, you lose any economies of scale. You need a balance between what is predictable and what is not.” That balance is difficult to maintain when the underlying technology continues to shift.

The end of comfortable assumptions

The transition away from copper is often discussed in terms of performance, but it is better understood as a consequence of physical limitation. Copper traces cannot be reduced indefinitely without compromising signal integrity. As data rates increase, those limitations become harder to work around.

This is pushing the industry towards optical solutions that were once considered advanced but are now becoming necessary. Co-packaged optics, where optical interfaces are integrated directly into silicon, are emerging as a response to constraints that can no longer be managed through incremental improvement.

“You can only make a copper trace so small before it loses effectiveness,” Franc continues. “That is why you are seeing co-packaged optics, where the optics are brought directly onto the chip. It helps with density, throughput, and power efficiency, which then feeds back into cooling requirements.”

What is often overlooked is how these changes ripple through the rest of the system. Improvements in interconnect efficiency affect power consumption, which in turn influences cooling design and overall facility planning. The physical layer is not isolated. It is entangled with every other aspect of the infrastructure.

At the same time, fibre technologies are evolving in ways that address some of the practical challenges. Expanded beam optics, for example, reduce sensitivity to contamination by widening the optical path, making connections less vulnerable to dust and handling errors. Multi-core fibres increase the amount of data that can be transmitted within a given physical footprint. These developments offer incremental relief, but they do not remove the underlying constraint. They shift it.

Designing without a fixed future

The idea that infrastructure can be future-proofed sits uneasily with the pace of change in AI. Hardware cycles are measured in years, sometimes less. Data centre construction operates on longer timelines. The two are not aligned. “People like to talk about designing for the future, but nobody knows what that will be,” Franc explains. “What will the next generation of hardware look like in three years? Nobody can answer that. So designing for it is often wasted effort.”

The alternative is not to abandon planning, but to change its focus. Instead of attempting to predict specific requirements, the emphasis shifts to making change manageable when it arrives. “The goal is not to reuse every piece of material,” he continues. “The goal is to make changes easy. If you build modular, scalable infrastructure, you can adapt even if you cannot reuse everything.”

This aligns with the direction of network architecture more broadly. McCabe describes a move towards flatter, wider designs that reduce latency and improve utilisation. The same principle applies at the physical layer. Systems must be able to expand and reconfigure without being constrained by rigid structures.

There is an inherent tension in this approach. Flexibility often requires upfront investment in standardisation and modularity, at a time when speed of deployment is prioritised. The temptation is to optimise for the immediate requirement, leaving future adaptation as a secondary concern. That decision carries consequences.

When infrastructure becomes the architecture

The distinction between infrastructure and architecture is becoming less meaningful. Decisions about cabling, optics, and physical layout are shaping the behaviour of systems in ways that were previously associated with higher layers of the stack. This is reflected in how networking vendors are engaging with the market. “We do not come in and say here is our network piece and leave,” McCabe explains. “Everything is becoming verticalised into a system. We work with partners to create an integrated solution, because the network is only one part of it.”

That integration extends across the entire environment. Optical transport, cabling strategies, switching architectures, and compute platforms are being designed together, not sequentially. The boundaries between them are becoming harder to define.

The move towards modular, repeatable designs is one response to this complexity. It introduces a level of standardisation that can support scale without requiring each deployment to be engineered from first principles.

“You would not build a skyscraper from scratch every time,” Franc says. “You use standard components to make it repeatable and easier to construct. Data centres need to move in the same direction.” The comparison is not perfect, but it highlights the underlying issue. The industry is attempting to build at scale using approaches that do not scale well.

The constraint that does not move

The physical layer has always been present, but it has rarely been treated as a primary concern. That is changing because it does not adapt at the same pace as the technologies it supports. Software can be updated. Architectures can be redesigned. Hardware can be replaced. The physical infrastructure, once installed, is far less forgiving. It defines the boundaries within which everything else must operate.

The challenge is that these boundaries are often invisible until they are encountered. They do not appear in design diagrams or performance models. They emerge in the act of building and operating systems at scale. AI has brought those boundaries into focus. The question is no longer how far models can scale in theory, but how far they can scale within the constraints of the physical systems that support them.

That distinction is beginning to reshape how infrastructure is designed. It is forcing a shift away from viewing the physical layer as an implementation detail and towards recognising it as a determining factor in what is possible.

The cables, the connectors, the pathways through which data moves, these are no longer secondary considerations. They are the limits within which AI must operate, and they are not moving at the same speed as the ambitions built on top of them.