The network revolution powering AI infrastructure

Mark Venables

AI In Depth, AI Hardware/Infrastructure, Exclusives

Share this article

Artificial intelligence is transforming the economics and architecture of the modern data centre, but the most profound disruption may be occurring inside the network itself. As GPU clusters scale to unprecedented sizes and data flows accelerate, traditional networking assumptions are breaking down, forcing operators to rethink everything from switching fabrics to operational management.

The global race to build artificial intelligence infrastructure has focused largely on processors, GPUs and the extraordinary energy demand of next generation data centres. Yet beneath those headline technologies lies another system undergoing rapid and consequential change. The data centre network, once considered a relatively stable layer of enterprise infrastructure, is rapidly emerging as one of the most critical components of the AI era. As organisations invest billions into accelerated computing environments, the ability of networks to move data efficiently between thousands of processors is becoming a defining factor in the performance and economics of AI systems.

For decades, data centre networks evolved in line with relatively predictable enterprise workloads. Most applications generated north south traffic flows, moving data between users and centralised services. Architectures such as three tier networks and later spine leaf fabrics were designed around this pattern, delivering reliable performance as compute capacity gradually increased. Scaling the network typically meant adding ports, increasing link speeds and refining management tools, but the underlying assumptions about traffic behaviour remained largely unchanged.

Artificial intelligence is dismantling those assumptions. Training large language models and running advanced inference workloads requires thousands of accelerators to communicate continuously across the network, exchanging enormous volumes of data during every stage of computation. Instead of predictable application traffic, AI clusters generate intense east west communication as GPUs exchange parameters, gradients and intermediate results across distributed systems. Network performance therefore becomes tightly coupled with the performance of the computing infrastructure itself.

Andrew Lerner, Distinguished Analyst at Gartner, believes the industry is only beginning to grasp the scale of the architectural shift underway. He explains that the data centre switching market has undergone substantial change over the past eighteen months as organisations attempt to build networks capable of supporting large scale AI clusters.

“Through 2028, organisations that attempt to use existing data centre switching infrastructure to support AI and generative AI workloads will waste at least 30 percent of their processing capability,” Lerner says. “In practical terms, inefficient network architecture can directly limit the performance of expensive GPU clusters, increasing training times and reducing the economic efficiency of AI infrastructure.”

Demand for high performance networking is also expanding well beyond hyperscale cloud providers. Sovereign AI platforms, neo cloud providers and enterprises building their own accelerated computing environments are all investing in switching fabrics capable of supporting massive GPU clusters. These systems already rely on networks operating at 400 gigabits per second, with 800 gigabit technologies rapidly approaching mainstream deployment.

Meeting these demands is forcing vendors and operators to rethink almost every aspect of data centre networking. Traditional hierarchical architectures are giving way to fabrics designed specifically for large scale parallel computing workloads, where massive east west communication and extremely low latency determine how effectively AI infrastructure can operate.

AI workloads break traditional network assumptions

If artificial intelligence is forcing a rethink of data centre networking, the reason lies in the way AI systems communicate. Traditional enterprise applications generated traffic patterns that were relatively predictable and manageable. Data typically flowed between users and centralised applications, creating the north south traffic patterns that shaped data centre architecture for decades. AI workloads behave very differently.

Henry Franc, Digital Automation Consultant for Data Centers at Belden, explains that the defining characteristic of AI infrastructure is the sheer volume of communication generated by distributed computing. “North south architectures resemble a tree structure,” Franc explains. “But AI environments increasingly behave like a mesh, where everything needs to communicate with everything else.”

Networks designed for traditional enterprise workloads often struggle to distribute traffic evenly across thousands of simultaneous connections between GPUs. Even small inefficiencies in traffic distribution can introduce latency or congestion that slows training jobs across entire clusters.

The scale of the change becomes clearer when viewed through the lens of infrastructure density. Franc notes that power consumption within data centres has increased dramatically over the past two decades as computing systems have become more powerful. “In the late 1990s we were designing facilities that consumed roughly one kilowatt per cabinet,” he says. “Today it is common to see 15 or 20 kilowatts per cabinet in enterprise environments, and significantly higher levels in high performance computing facilities. In the most advanced AI environments, power densities of 100 kilowatts or more per rack are becoming increasingly common.”

These changes place enormous pressure not only on network architectures but also on the physical infrastructure that supports them. Cabling systems, switching fabrics, cooling systems and power distribution must all evolve together to support the scale and performance requirements of modern AI clusters.

Franc argues that this requires a more holistic approach to infrastructure design in which networking, power and cooling are treated as interconnected elements of a single architectural problem. “Every generation of computing demands more data,” he says. “In the early days people believed we would never need more than a megabit of bandwidth for office networks. Since then we have continuously expanded the amount of data we consume and process. AI is simply the latest and most dramatic expression of that trend.”

The result is a networking environment in which traditional assumptions are being replaced by architectures designed for massive parallel computing, where thousands of accelerators must exchange data simultaneously across distributed systems.

The emergence of AI network fabrics

If Belden highlights the architectural challenge created by AI workloads, networking vendors are responding with new fabric designs capable of supporting distributed accelerator clusters.

One of the defining characteristics of modern AI training environments is the constant cycle of computation and communication between processors. As models are trained, thousands of accelerators perform calculations locally before exchanging data with their peers across the network before beginning the next stage of computation. As outlined in Arista’s AI Networking white paper, large scale AI workloads typically follow a compute exchange reduce pattern in which processors repeatedly process data, exchange intermediate results with other systems and then combine those results before continuing the training process.

A substantial portion of overall training time in distributed AI environments can therefore be spent moving data between processors rather than performing the calculations themselves. In practice this means that bottlenecks in the network fabric can directly increase the time required to complete model training.

The scale of these environments is expanding rapidly. Arista’s analysis of large AI deployments shows that clusters used for training advanced models can contain tens of thousands of servers equipped with multiple accelerators operating as a single distributed system. In one scenario explored in the white paper, a cluster containing more than 16,000 AI servers would interconnect over 130,000 accelerators through a high speed Ethernet fabric, requiring thousands of switches and hundreds of thousands of optical connections operating together as a unified network.

These developments are driving renewed interest in Ethernet as the foundational technology for AI networking. High performance computing clusters historically relied heavily on proprietary interconnect technologies such as InfiniBand. However Arista argues that Ethernet’s openness and vast ecosystem make it increasingly attractive as a unifying technology capable of supporting both enterprise networking and AI training fabrics within the same operational environment.

At the same time, the underlying protocols governing network communication are evolving to address the unique characteristics of AI workloads. In its Demystifying Ultra Ethernet white paper, Arista explains that large scale AI environments generate extremely high volume data flows that begin abruptly and terminate just as quickly. These patterns challenge traditional networking approaches that rely on distributing traffic flows along fixed paths through the network.

The Ultra Ethernet Consortium is therefore developing a communications stack designed specifically for AI and high-performance computing workloads. The initiative aims to enhance Ethernet with new transport mechanisms and congestion control capabilities that allow networks to handle massive bursts of synchronised traffic more efficiently.

Together these developments illustrate how rapidly networking technology is evolving in response to AI infrastructure requirements. As training clusters scale into tens of thousands of accelerators and bandwidth requirements approach terabit speeds, the network fabric itself is becoming one of the most critical elements of modern data centre design.

From one network to many networks

If the evolution of AI networking fabrics represents a technological shift, the architectural implications run deeper. For decades most data centres operated around a relatively simple networking model built on a single fabric connecting compute, storage and applications. AI infrastructure is dismantling that simplicity by introducing multiple specialised networks operating simultaneously within the same environment.

Murali Gandluru, Vice President of Product Management for Data Center Networking at Cisco, explains that the emergence of large-scale AI clusters is forcing architects to rethink networking design. “When we talk about networking for AI, we are no longer talking about a single data centre network,” Gandluru says. “Historically we had one Ethernet fabric that connected servers, storage and the wide area network. With AI we now have multiple specialised networks that serve very different roles within the same infrastructure.”

The first of these is the scale out network, which connects servers and accelerators across an entire training cluster. “The scale out network is where you connect large numbers of GPU servers together so that they can operate as a single training cluster,” Gandluru explains. “The communication between GPUs during model training requires far more capacity and far lower latency than the networks we historically built for enterprise workloads.”

Alongside the scale out fabric sits the scale up network, which operates inside individual server racks and connects accelerators directly to one another.“The scale up network is about enabling GPUs to communicate with each other at incredibly high bandwidth,” Gandluru says. “In many cases this can involve communication speeds that are orders of magnitude higher than what we traditionally saw in data centre networks.”

This architectural layering means that modern AI environments must support multiple communication patterns simultaneously. Large, distributed clusters rely on high capacity scale out fabrics, while tightly coupled accelerator systems depend on ultra-fast scale up interconnects within racks.

When AI leaves the data centre

While much of the discussion around AI infrastructure focuses on massive training clusters inside hyperscale data centres, the next stage of the AI evolution is already pushing intelligence beyond those facilities. As enterprises move from experimentation to operational deployment, artificial intelligence is increasingly being embedded directly into the environments where decisions must be made.

Jeremy Foster, Senior Vice President and General Manager for Cisco Compute, argues that the industry is entering a phase where AI workloads are spreading across a far wider operational footprint than traditional cloud architectures ever required.

“AI is not waiting for the data centre anymore,” Foster says. “It is happening everywhere that data is created, in factories, in hospitals, in logistics hubs and in retail environments where milliseconds matter and where decisions directly shape operational outcomes.”

This shift is largely being driven by inference workloads. While training large models typically occurs inside highly specialised clusters located in hyperscale data centres, the application of those models increasingly happens much closer to the source of data generation. Industrial equipment, medical imaging systems, autonomous vehicles and retail platforms all produce vast streams of operational data that must be analysed in real time.

Moving that data back and forth between centralised data centres introduces latency, bandwidth consumption and potential security risks that many organisations now seek to avoid. “Across industries we are seeing a clear trend toward moving AI closer to where data is born,” Foster says. “Customers recognised that artificial intelligence was outgrowing the boundaries of the traditional data centre. They needed the power of data centre infrastructure to exist much closer to the front lines of the enterprise.”

This trend is giving rise to a far more distributed AI infrastructure landscape. Rather than relying solely on centralised facilities, organisations are building hybrid environments in which large training clusters coexist with regional inference infrastructure and edge deployments located inside operational environments.

From a networking perspective, this distribution dramatically increases complexity. Data must move efficiently between centralised training environments and decentralised inference locations, while maintaining low latency and high reliability across widely dispersed systems. Networks must therefore support not only massive east west traffic inside training clusters but also wide area communication between data centres, edge infrastructure and enterprise environments.

In practice this means that AI networking is no longer confined to the boundaries of the data centre itself. It now spans entire digital ecosystems.

Operating AI networks at scale

As AI infrastructure expands in both size and complexity, the challenge for operators is no longer simply building high performance fabrics. Running these networks reliably is becoming an equally significant problem.

Ben Baker, Senior Director of Cloud and Data Center Marketing and Business Analysis at Juniper Networks, argues that operational complexity has always been one of the defining challenges of large data centre networks. The scale of modern AI infrastructure is amplifying those challenges dramatically.

“People are often afraid to touch their networks,” Baker says. “Many engineers feel nervous before, during and after making a change because they worry that something might break. Modern networks contain thousands of devices, dozens of protocols and enormous amounts of operational data.”

As AI clusters grow to include tens of thousands of accelerators connected through multiple switching tiers, the operational consequences of network disruption become far more significant. Small configuration errors or congestion events can affect training jobs worth millions of dollars in compute resources.

“The whole purpose of a data centre network is to support the applications running on top of it,” Baker explains. “But operators often have very limited visibility into what those applications are doing across the infrastructure. When something goes wrong the network team can end up drowning in telemetry while still lacking the insights needed to identify the root cause.”

This is one reason why artificial intelligence is increasingly being applied to network operations themselves. “AI is extremely powerful when it comes to analysing large volumes of operational data and identifying correlations that humans might miss,” Baker says. “But networking also requires deterministic behaviour. The real value comes from combining AI driven insights with deterministic systems that ensure the infrastructure operates reliably.”

In other words, the rise of AI infrastructure is not only transforming how networks are designed. It is also reshaping how they are operated. Operators must manage environments where traffic patterns fluctuate rapidly according to training cycles, inference demand and system behaviour across distributed environments.

Automation, telemetry and predictive analytics are therefore becoming fundamental operational capabilities rather than optional enhancements.

The network becomes the system

Taken together these developments illustrate how profoundly artificial intelligence is reshaping the role of networking inside the modern data centre. Traffic patterns are changing, new network fabrics are emerging and entirely new architectural layers are appearing as scale up and scale out networks operate alongside traditional enterprise infrastructure. At the same time inference workloads are pushing intelligence toward the edge, extending AI infrastructure far beyond the boundaries of the data centre.

What emerges is a very different picture of networking from the one that existed only a few years ago. Rather than acting simply as the connective tissue between servers and storage, the network is increasingly becoming an integral part of the computing system itself.

The performance of AI infrastructure is no longer determined solely by the speed of individual processors or the density of accelerators inside server racks. It is shaped by the ability of thousands of computing systems to communicate continuously and reliably across distributed environments.

In this context networking becomes a defining factor in the economics of AI infrastructure. Inefficient fabrics waste expensive compute resources, while poorly designed architectures limit the scale at which models can be trained and deployed. Conversely, high performance network architectures enable organisations to extract greater value from their compute investments by ensuring that processors spend more time performing calculations and less time waiting for data.

The transformation now underway therefore represents more than an incremental evolution of data centre networking. It is a structural shift in how computing systems are built.

In the era of artificial intelligence, networking is no longer simply a background technology. It is rapidly becoming one of the foundational systems that will determine how effectively the next generation of digital infrastructure can operate.