Intelligent automation is redefining AI infrastructure resilience and developer speed

Mark Venables

AI In Depth, AI Hardware/Infrastructure, Exclusives

Share this article

As power ceilings close in and AI models grow more complex, the combination of integrated observability and intelligent automation is emerging as the critical enabler for enterprise-scale experimentation and operational agility.

The leap from tailored language models to large-scale, general-purpose systems is not merely a technological transition; it is an infrastructure reckoning. DeepL, a Cologne-based AI company best known for its high-quality translation services, has spent the past eight years building its own vertically integrated language platform, training custom models, acquiring data, developing applications, and deploying at scale.

As the company expanded beyond text translation into voice and writing assistance, the limitations of its existing infrastructure became clear. To unlock the next phase of development, DeepL needed more than larger models. It required a reimagining of the entire foundation upon which those models would run.

“We had many thousands of GPUs at the time, but they were not nearly capable enough to build these much, much larger models,” Stefan Mesken, VP of Research, DeepL, says. “The scale of models went up by roughly a factor of 100. So if you want to train them and do this in-house, you need some serious infrastructure to make that happen.”

The result was DeepL Mercury, one of the first A100 DGX SuperPODs to be deployed commercially, which was brought online in June 2023. Hosted at a modern, liquid-cooled facility in Sweden focused on sustainability, Mercury has become the cornerstone of DeepL’s transition into LLM-powered services. This move has enabled DeepL to develop and launch new products, most notably DeepL Write Pro, a collaborative writing assistant for enterprise users, and Clarify, a translation feature designed to foster dynamic interaction with AI.

But Mesken is quick to emphasise that the cluster alone was not enough. “It is not just hardware. There is a lot of software and a lot of human expertise,” he explains. “Thousands of language experts around the globe give feedback to our models and really help them align with users’ expectations. If you are training on web data, you get something back that is roughly as good as the data itself. It is possible to move beyond that, but that is where we put most of our effort.”

Designing the AI factory floor

Building this next-generation AI infrastructure has necessitated a reimagining of the data center, not just as a physical space but as a programmable environment. Enterprises such as DeepL need more than raw compute, they need visibility, automation, and orchestration tools to eliminate downtime, accelerate experimentation, and optimise performance in real time. These demands have shaped NVIDIA’s approach to the modern AI factory.

Pradyumna Desale, Product Line Manager at NVIDIA, defines the problem plainly: “There is a skills gap,” he says. “There is a dearth of site reliability engineers across the world. So getting to faster first time-to-train is what we are trying to address here.”

Central to this is integrated observability, a unified data plane that spans compute, storage, networking, power management, and cooling. NVIDIA Mission Control serves as the aggregation layer for all these systems, turning the data center into a single pane of glass for infrastructure operations.

Desale draws on DeepL’s own experience to illustrate the importance of this. “Stefan told me about a problem they ran into at one point in time; NCCL RDMA was not giving a consistent performance across all of the GPUs,” he says. “They built custom tooling to diagnose it. What we are building is intended to give that kind of visibility out of the box.”

The key lies in intelligent automation. Features such as virtual rack management and AI-driven inventory control enable administrators to monitor liquid cooling systems and manage power sequencing across racks and service-specific compute trays with minimal intervention. When integrated with existing Building Management Systems (BMS) and Operations and Maintenance (OT) systems, these tools bridge the gap between IT and facilities, ensuring uptime without compromising power budgets or thermal integrity. “There is not a power-on button for these racks,” Desale adds. “Sequencing must happen through software, and it has to be aware of data center limitations. We are building hierarchical policies at tray, rack, and site level to support that.”

Resilience as a software-defined function

Traditionally, resilience in HPC has meant failover, restarts, and manual triage. In the AI era, where a single job may consume thousands of GPUs for days or weeks, the stakes are significantly higher. A minor hardware fault can translate into massive opportunity costs. For this reason, NVIDIA’s latest capabilities, packaged within Mission Control, recast resilience as a software-defined function. At its core is a three-part engine: faster checkpointing and restart, unified error reporting, and automated anomaly attribution and repair.

“Automated anomaly attribution is something I am particularly excited about,” Desale continues. “We are collecting telemetry from system logs, out-of-band sensors, PMCs, and libraries. Today, we can detect more than 55 different types of known error signatures. That means our customers benefit from everything we learn across the entire global fleet, not just from their own deployments.”

When errors are detected, NVIDIA’s autonomous recovery engine initiates background triage. If hardware can be recovered through restart or firmware updates, it is returned to the pool. If not, it is flagged for RMA, with logs automatically uploaded to enterprise support. These features not only reduce downtime but also reduce the dependency on highly specialised engineers. In high-density, dark data center environments, this capability could mean the difference between days and minutes.

“Administrators can double-click a node on the inventory dashboard and light up a service indicator on the hardware,” Desale adds. “That is how we reduce time to service. And by embedding firmware config checks and updates into the workflow, we make the entire environment more robust,”

Making every watt count

The push toward larger models is bumping up against real-world constraints. Power is becoming the limiting factor for AI expansion, particularly in Europe and other mature markets. This is not a temporary condition; it is a structural shift. Data center operators must optimise not just for performance but for power-aware scheduling at scale.

NVIDIA’s response is workload power profiling, effectively dynamic power policies for enterprise AI. Developers can select power profiles at runtime that balance compute, memory, and I/O needs based on known benchmarks and performance requirements. This allows operators to allocate more power where it is required or reduce draw in less critical areas. “This is like the performance modes you choose on a laptop, but at the data center scale,” Desale says. “It allows energy to be redirected based on workload sensitivity and environmental constraints. It is integrated with schedulers like SLURM and Kubernetes and can adapt in real-time.”

Power-aware orchestration also supports tiered job restarts. Not all failures require a complete system reset. By segmenting errors into restart classes, such as process-level, job-level, or node-level, systems can recover more quickly and preserve energy and compute efficiency. Taken together, these developments form a framework for resilient, sustainable, and developer-friendly AI operations. With AI inventory tools, anomaly attribution, workload-aware profiling, and autonomous recovery, the data center becomes more than an environment. It becomes an active collaborator in AI development.

Infrastructure as a competitive advantage

For DeepL, the benefits of this infrastructure transformation are already measurable. The launch of LLM-powered products, a significant jump in translation quality, and the shift to collaborative language interfaces all stem from its Mercury cluster and associated tooling.

But Mesken is looking further ahead. “We are expecting delivery of our Blackwell system very soon,” he says. “The question is, of course, why are we doing this? The short answer is that the LLM program worked. But more importantly, we are excited about the opportunities that we do not yet know about. Whenever we extended our capabilities to train more advanced models, product development was accelerated. Now that we have even more compute, our task is to figure out how to harness it, not to beat some artificial benchmark, but to address the real problems our users have today.”

The implications go far beyond translation. For enterprises building in the age of trillion-parameter models, the infrastructure itself becomes a source of strategic differentiation. Resilience, observability, and automation are no longer secondary concerns; they are essential. They are the enablers of velocity.

As Mesken concludes, “We want to enable true human-AI collaboration. That is not just a research goal; it is a product development philosophy. And it demands an infrastructure that keeps pace with ambition.”