Building scalable AI for control systems demands more than models, it demands infrastructure

Mark Venables

Share this article

Artificial intelligence has the potential to revolutionise control systems, but real transformation demands more than clever algorithms. It demands infrastructure capable of training, deploying, and validating AI at scale with the speed, reproducibility, and rigour required for mission-critical operations.

AI promises to automate complex control tasks, optimise performance, and even introduce entirely new capabilities. Yet the real challenge is not just building smarter models, but building models that can be trusted, tested, and deployed across a wide range of scenarios without prohibitive time or cost.

That was the overriding theme of Jason Beck, Fellow, and Trevor McCants, AI Research Engineer Sr., both from Lockheed Martin, during their NVIDIA GTC 2025 presentation. Demonstrating the development of an advanced control system using Astris AI Factory’s tech stack, they made it clear that scalable AI must be designed with enterprise-grade infrastructure at its foundation.

“One of the exciting things about AI is its ability to take products to the next level,” Beck explains. “We can automate complex tasks, optimise system performance and even create new functionality that we have not been able to before. These capabilities are allowing us to develop our products faster than ever before with AI job aids, delivering better functionality, greater capability for our customers, and making solutions more affordable.”

At the heart of this effort is Astris AI Factory, a commercial MLOps platform engineered for enterprise-scale AI development. Its flexibility, security, and scalability underpin the entire development lifecycle, from model creation to deployment within highly secure, closed networks.

The perfect catch and the physics of trust

To illustrate the power of their approach, Beck and McCants introduced an example problem: teaching a smart football to achieve the perfect catch. This was not simply a theoretical exercise. It provided a real-world demonstration of how Astris AI Factory infrastructure enables robust AI development for complex, dynamic control systems.

The objective was deceptively simple. The AI agent had to control the flight of a robotic football such that it landed within a one-foot radius of a moving player, with maximum retained velocity to reduce interception risk. Inputs included the football’s velocity, flight path, and position, as well as the player’s own dynamic position and speed. The system outputs were acceleration commands, up or down, dynamically adjusting the ball’s angle of attack.

Achieving this required far more than traditional control engineering. Reinforcement learning was used to optimise the system’s decision-making, underpinned by a carefully designed reward structure that balanced proximity to the target, energy conservation, and overall velocity preservation. The control law was not programmed directly; it was learned through iterative interaction between agent and environment.

“We distribute the player’s initial downfield position at 50 yards, plus or minus a standard deviation of seven yards, and their velocity centred at zero with a standard deviation of seven miles per hour,” McCants explains. “The agent must navigate this uncertainty, and through reinforcement learning, discover the control policy that maximises reward across this space.”

Building trust from simulation to deployment

The football challenge was executed inside ARISE, Lockheed Martin’s simulation framework, a cornerstone of their strategy. ARISE provides a common library of validated physics models shared across programmes, creating a rare level of fidelity and trust in simulation-based development.

“The ARISE simulation framework is built around shared, validated physics models, supporting everything from mass properties to GPS, attitude control and aerodynamics,” Beck outlines. “Hardware tests on one programme strengthen the confidence in simulations across others, dramatically improving fidelity and speeding development cycles.”

Critically, ARISE is fully integrated with Astris AI Factory’s infrastructure. Workspaces provide containerised environments for rapid simulation development. Jobs interfaces enable large-scale distributed testing, running thousands or even millions of simulation episodes in parallel. Integration with Farama Gymnasium allows seamless Python-based reinforcement learning development, despite ARISE’s C++ roots, ensuring high performance without sacrificing flexibility.

The use of ARISE Optimus, a modular Flyte-based pipeline for reinforcement learning, provided the engine behind the AI agent’s training. The setup, training, testing, and deployment phases were all defined as independent, composable tasks, enabling reproducibility, modularity, and rapid scaling across infrastructure environments from laptops to HPC clusters.

“Optimus lets us think about the problem and not the infrastructure,” McCants explains. “You can develop locally, then scale up when needed, maintaining the same code, the same workflow, and the same reliability.”

Beyond training to validation at scale

Training an AI agent is only the beginning. Demonstrating performance across the full operating envelope is critical to real-world deployment. Here, ARISE Historian, an episodic time series database, becomes vital.

Test data from thousands of simulation runs were ingested, versioned, and stored at petabyte scale, enabling both rapid analysis and long-term traceability. Historian’s design allowed both raw data and metadata to be seamlessly linked, ensuring that each run could be analysed not just by outcomes but by its context, parameters, and environmental conditions.

Visualisation and interactive analysis were then performed using ARISE Insight, a server-side application tightly coupled to the database. Executives reviewing the system could immediately explore trajectory plots, velocity profiles, and catch success rates across thousands of runs, identifying failure modes, performance boundaries, and unexpected behaviours.

“For almost all player velocities where the player is running towards us, we can successfully throw the football to them,” McCants explains. “However, when the player is running away at high speed and is further downfield, performance begins to degrade. We can clearly see these regions in the probability of catch plots, guiding future improvements.”

Automating anomaly detection and risk discovery

Even with large-scale validation, subtle failure modes can escape manual review. To address this, the team deployed ARISE ML Analysis Pipelines, automating the detection of anomalies across massive datasets.

Data processing, hyperparameter tuning, distributed training of anomaly detection models, inference, and model scoring were all orchestrated within the Astris AI Factory infrastructure. Detected anomalies, such as unexpected loss of energy or abnormal trajectories, were automatically fed back into Historian and visualised within Insight.

“This approach lets us uncover hidden failure modes that might otherwise be missed,” Beck explains. “It is not just about finding catastrophic failures, but subtle patterns, oscillations, or rare conditions that could undermine performance or safety in operational use.”

By building anomaly detection into the workflow, risks are identified early, models can be refined proactively, and the confidence in deployed AI control systems can be significantly strengthened.

Infrastructure is the enabler of trustworthy AI

The demonstration with the smart football was not simply about showing a clever application of reinforcement learning. It was a statement about the importance of infrastructure in AI development.

Every element, from the common simulation framework to the scalable reinforcement learning pipelines, petabyte-scale data storage, automated anomaly detection, and interactive visualisation, was critical to moving from experimental curiosity to deployable, trustworthy AI control systems.

Without the right infrastructure, AI remains a set of promising algorithms in search of a delivery mechanism. With it, AI becomes an integrated, verifiable, and scalable part of the product lifecycle.

The broader lesson for enterprises is clear. Building AI to automate complex control systems is not just about better models. It is about designing better workflows, better data pipelines, better simulation environments, and better validation strategies. Only by doing so can AI move beyond isolated pilots and become a transformative force across industry.

As Beck concludes, “I hope we have demonstrated how AI, supported by the right infrastructure, can deliver advanced control systems with the speed, scale, and robustness that modern challenges demand.”