Optics moves to the centre of AI infrastructure as system limits tighten

Mark Venables

AI In Depth, AI Factories, AI Hardware/Infrastructure, Exclusives

Share this article

The race to scale AI has been framed as a compute problem, but that framing is beginning to fracture under pressure. As systems expand and performance gaps widen, it is the movement of data, not just its processing, that is becoming the defining constraint.

The industry has spent the better part of a decade fixated on compute, on the accelerating performance of GPUs, the size of training runs, and the competitive advantage that comes with scaling models faster than rivals. That focus has been justified, but it has also obscured a more fundamental tension building beneath the surface. Systems are no longer constrained solely by how fast they can process data, but by how efficiently they can move it, and that distinction is beginning to reshape the architecture of AI infrastructure in ways that are both subtle and profound.

Dr Julie Sheridan Eng, Chief Technology Officer at Coherent, describes this shift not as an incremental adjustment, but as a structural reorientation of the industry. Speaking at the OFC Conference in Los Angeles, she positions photonics as a technology that has moved from the margins of system design to its centre, not because of a change in its capabilities, but because of a change in what systems now demand from it.

“For decades, photonics has scaled communications, increasing bandwidth density, improving energy efficiency and decreasing costs, largely behind the scenes,” she says. “But today feels different. Optics is no longer in the background. It is front and centre as a key architectural consideration in the AI data centre.”

That shift is not simply about visibility. It reflects a deeper change in how infrastructure behaves, where interconnect is no longer a supporting layer, but a defining component of system performance, one that determines how effectively compute can be utilised at scale.

From connection to computation

Dr Eng frames the current moment through the lens of history, describing a progression of optical innovation shaped by the demands of each era. Early telecommunications networks prioritised reliability and performance, often at the expense of cost and scale, while the rise of enterprise data centres introduced a need for flexibility, interoperability, and rapid deployment. The transition from electrical to optical interconnect in those environments marked a turning point, but it remained fundamentally about connecting systems rather than shaping them.

“In the 80s and 90s, optics transformed long-haul telecommunications, and the most important parameters were performance and reliability,” she says. “Then in the 2000s, the enterprise data centre became large enough that it had to transition from electrical to optical, and that brought in the era of the pluggable transceiver, where flexibility and security of supply became critical.”

The hyperscale era extended those dynamics further, driving rapid increases in bandwidth density and forcing the industry to adopt new materials, integration techniques, and modulation strategies. Yet even as complexity increased, the role of optics remained consistent, it connected compute resources, but it did not define how those resources behaved as a system.

That distinction no longer holds. The AI data centre operates under fundamentally different conditions, where performance is dictated not just by individual devices, but by how effectively thousands of them can operate in parallel. “When you think about it, the AI data centre is now a massively distributed supercomputer, and optics is inside the machine,” Dr Eng says. “That is a very different role, because system performance now depends directly on how efficiently those elements communicate with each other.”

The scale of progress achieved to date provides context for how significant this transition is. Over two decades, the industry has delivered exponential gains in bandwidth density, dramatic improvements in energy efficiency, and sustained reductions in cost per bit, achievements that have underpinned the growth of modern digital infrastructure. “We improved bandwidth density by more than 150 times, energy efficiency by 40 times, and reduced the price per gigabit by 60 times,” she says. “Each of those would have been significant on their own, but what is remarkable is that we achieved all three simultaneously, generation after generation.”

Those gains, however, are no longer sufficient to meet the demands of AI at scale, particularly as the pace of innovation must now accelerate while physical limits become more pronounced.

The widening performance gap

At the heart of the challenge is a divergence between compute demand and the rate at which individual devices can improve. The computational requirements associated with training modern AI models are increasing at a pace that far exceeds the incremental gains delivered by semiconductor scaling, creating a widening gap that cannot be closed through traditional approaches.

“The compute required to train large language models is growing by about four and a half times per year,” Dr Eng says. “But Moore’s Law and Dennard scaling are giving us maybe two times improvement every two years, so there is a gap, and that gap is widening.”

The consequence of that gap is a structural shift towards parallelism, where performance is achieved not by making individual devices faster, but by distributing workloads across increasingly large clusters of processors. That approach changes the nature of the problem, moving the bottleneck away from compute and towards interconnect. “When that happens, system scaling shifts to parallelism, and parallel systems depend more on interconnect,” she says. “That interconnect is increasingly optical, because it is the only way to deliver the bandwidth and efficiency required at those scales.”

Dr Eng describes this in terms of three domains that together define the architecture of the AI data centre. The scale up domain connects processors within a node, enabling them to operate as a single logical unit, while the scale out domain links those nodes into a distributed system capable of handling large-scale workloads. The scale across domain extends that connectivity between data centres, enabling geographic distribution of compute.

“The scale up domain is the networking together of processors to act like a single compute node, the scale out domain connects those nodes into a massively distributed system, and the scale across domain connects data centres,” she says. “Across all three, system performance now depends directly on interconnect bandwidth and efficiency.” This reframing is critical, because it positions interconnect not as an enabler of compute, but as a determinant of its effectiveness, one that must scale at least as quickly as the workloads it supports.

The limits of flexibility

Within the data centre, this tension becomes most visible in the evolution of pluggable optical transceivers, which have long provided the flexibility and interoperability required for large-scale deployments. Their ability to support multiple technologies, reach distances, and vendor ecosystems has made them indispensable, but that flexibility is increasingly coming into conflict with the need for higher bandwidth density.

“The transceiver offers flexibility because it is a standardised, multi-vendor ecosystem, and it allows architectural decisions to be deferred,” Dr Eng says. “But that flexibility comes with an architectural consequence, because the size of the transceiver limits the bandwidth density.”

As demand for bandwidth continues to increase, the constraints imposed by physical form factors become more pronounced, forcing the industry to pursue two parallel strategies, increasing the data rate per lane, and increasing the number of lanes within each module. Both approaches introduce significant complexity, particularly as they push the limits of existing materials and device technologies.

“The per lane data rate is primarily limited by the laser and modulator technology, and as those limits are reached, we increase the number of lanes,” she says. “But that is not just adding more components, it becomes a complex system integration problem that requires innovation across multiple disciplines.”

The diversity of technologies within the data centre reflects this complexity, with different materials and architectures optimised for specific performance characteristics. “The gallium arsenide VCSEL is the highest volume and lowest energy per bit for short reach, silicon photonics provides a strong platform for integration, and indium phosphide offers the best performance for longer distances,” she says. “These technologies coexist because each solves a different problem.”

That coexistence underscores a broader point, that scaling cannot rely on a single dominant approach, but must instead draw on a combination of technologies, each aligned to specific constraints and use cases.

Rearchitecting the system

As the limitations of pluggable architectures become more apparent, attention is shifting towards approaches that bring optics closer to the compute itself. Co-packaged optics represents a fundamental change in how systems are structured, redistributing optical components within the architecture to reduce electrical path lengths and improve efficiency.

“Co-packaged optics is not a new device category, it is an architectural repartitioning of the pluggable transceiver,” Sheridan Eng says. “By moving the optics closer to the switch or processor, we can reduce power consumption and increase bandwidth density.”

The benefits are clear, but they come with trade-offs that extend beyond technical considerations. “When you move the optical engine inside the system, you reduce flexibility, because the architecture and technology choices are locked in earlier, and serviceability becomes more difficult,” she says. “So it becomes a question of where efficiency outweighs flexibility.”

Different architectural approaches are emerging within this model, reflecting the same diversity seen in earlier generations of optical technology. “In a fast and narrow approach, you serialise signals to higher speeds and reduce the number of lanes, which reduces fibre count but increases demands on high-speed electronics,” she says. “In a slow and wide approach, you keep signals at lower speeds and increase the number of optical engines, which simplifies the electronics but increases packaging complexity.”

Neither approach is inherently superior, and both are likely to coexist as the technology matures. “The optimal choice depends on system constraints such as power, packaging density, and fibre management,” she says. “So we expect to see multiple approaches, just as we do today.”

This shift is accompanied by broader changes in network architecture, including the re-emergence of optical circuit switching as a means of dynamically reconfiguring connections within the data centre. “These workloads are large and predictable, which makes optical circuit switching useful for reconfiguring network topology,” she says. “It allows operators to allocate resources differently depending on the workload, and to improve utilisation and resilience.”

At the same time, thermal constraints are becoming increasingly important, as the power consumption of modern accelerators pushes the limits of existing cooling technologies. “A five to ten degree difference in temperature can significantly impact performance or reliability,” she says. “That is why new materials such as diamond and silicon carbide are becoming important, because they offer higher thermal conductivity.”

These developments reinforce the need to think of infrastructure as an integrated system, where compute, interconnect, power, and cooling are tightly coupled and must be optimised together.

A system defined by interconnect

The pressures shaping the AI data centre extend beyond its physical boundaries, as workloads increasingly span multiple facilities and require high-capacity, long-distance interconnect. In this context, technologies traditionally associated with telecommunications are becoming integral to AI infrastructure, bringing with them new challenges and constraints.

“These links can range from ten kilometres to over a thousand kilometres, and technologies we previously thought of as telecom are now part of the AI network,” Dr Eng says. “But we are reaching physical limits again, particularly in terms of spectral efficiency.”

As gains from increasing spectral efficiency diminish, scaling shifts towards expanding usable spectrum, increasing spatial parallelism, and integrating systems more tightly. “Each additional improvement in spectral efficiency requires a higher signal to noise ratio, so the returns diminish,” she says. “That means we have to look at expanding spectrum, using more fibre pairs, and improving system integration.”

The underlying pattern is consistent across all domains, scaling is becoming more difficult, and the solutions require increasingly sophisticated combinations of technologies and architectural approaches. “Scaling demands are intensifying across all domains of the AI network, precisely at a time when physical limits are tightening,” she says. “That means innovation in optics has to accelerate.”

The conclusion is not simply that optics is important, but that it is becoming central to how AI systems are conceived and built. “AI system scaling will depend increasingly on innovation in optics,” Sheridan Eng says. “And there is no one size fits all solutions; the different technologies are tools in an architectural toolbox.”

For an industry that has historically focused on individual components, this represents a shift towards system-level thinking, where performance is defined by the interaction of multiple technologies rather than the advancement of any single one. “We are in a new era for optics, an era in which optics is embedded inside the compute architecture itself,” she says. “And from my perspective, after more than three decades in this field, I cannot think of a more consequential and exciting time to be working in optics.”

That observation carries weight not because it is optimistic, but because it reflects the scale of the challenge ahead. The limits are real, the pace is accelerating, and the margin for error is narrowing. Optics is no longer a supporting layer within AI infrastructure.

It is becoming the foundation on which its future depends.