Purpose-built AI infrastructure is redefining how financial services scale intelligence

Mark Venables

AI In Depth, AI for Enterprise, AI Hardware/Infrastructure, AI Solutions, Exclusives

Share this article

AI infrastructure in financial services must now deliver at the scale, speed and security that advanced trading strategies demand. Purpose-built GPU platforms are replacing generic cloud solutions as institutions race to turn data into real-time competitive advantage.

Senior executives in financial services are realising that competitive advantage no longer comes from simply deploying AI but from the precision, scale and reliability of the infrastructure that underpins it. As the volume and complexity of data explode, traditional models and legacy systems are quickly falling behind. CoreWeave’s collaboration with global trading firm Jane Street offers a clear example of how purpose-built AI infrastructure is evolving to meet the sector’s exacting demands.

AI in finance demands new levels of flexibility

Trading strategies, fraud detection models, market risk simulations, and personalised wealth management tools are increasingly built and refined with AI at their core. However, as financial institutions expand these capabilities, the limitations of conventional infrastructure become more apparent. For AI to deliver, training times must fall, workloads must scale instantly, and security must be non-negotiable.

“Training, fine-tuning and inference at the level required by modern financial services is only possible when you are connecting thousands of GPUs at once, with low-latency, high-bandwidth networking and petabytes of secure storage,” John Mancuso, Vice President of Solutions Architecture at CoreWeave, says “These are no longer aspirational targets, they are table stakes.”

CoreWeave offers bare-metal infrastructure designed for accelerated computing. Every compute instance is a single tenant. There is no hypervisor layer, no shared virtual machine environment, and, therefore, no risk of virtual machine escape or resource contention. This design inherently improves performance while simultaneously reducing the attack surface.

Security remains central. All data is encrypted at rest, while traffic between nodes is fully isolated. For customers who operate at scale, CoreWeave offers flexible isolation models, ranging from single-tenant storage to entire private data centres. Crucially, financial institutions can audit everything. The infrastructure delivers full observability into logs and metrics at the hardware level, creating a verified and transparent environment for AI workloads.

Interconnect and bandwidth drive model performance

For Jane Street, performance at scale is not a theoretical concern. The firm’s researchers and traders require distributed training capabilities that minimise latency across thousands of GPUs. This is achieved through CoreWeave’s use of NVIDIA Quantum InfiniBand, delivering up to 3.2 terabits per second between GPU nodes. The cluster is constructed with a full fat-tree, non-blocking architecture that avoids congestion even under full load.

“We leverage eight NICs per server with GPU Direct RDMA to streamline data flow,” Peter Salanki, Chief Technology Officer at CoreWeave, adds “That ensures we are keeping CPUs out of the loop and allowing training operations like all-reduce to complete significantly faster.”

Bandwidth extends beyond compute. Storage is engineered to keep pace with the demands of modern AI. Jane Street primarily works with file storage, and CoreWeave’s deployment of Vast Data’s scalable file system offers consistently high throughput, regardless of simultaneous workloads. The system supports features such as encryption in transit and at rest, cross-cluster replication, and seamless upgrades.

“What matters is that we can deliver file systems with petabyte scale, full tenant isolation, and zero-disruption maintenance,” explains Salanki. “That allows customers to use the same stack in the cloud as they do on-prem and move data between the two without breaking workflows.”

Scaling infrastructure must match the pace of research

AI adoption in finance is often constrained not by ambition but by infrastructure bottlenecks. The inability to access GPUs on demand or replicate training environments across locations can quickly derail experimentation. For Jane Street, infrastructure flexibility has become a foundational aspect.

“Financial institutions typically underestimate just how rapidly AI compute needs can change,” Mancuso says. “It is not about building the biggest system upfront. It is about being able to grow, shift architecture, or relocate workloads across data centres on short notice.”

CoreWeave operates around 300,000 Hopper GPUs across a growing global footprint, with new data centres brought online every month. Customers can align their capacity needs with the company’s deployment schedule, selecting preferred locations, architectures, and tenancy models that best suit their requirements.

“When a customer wants to test a new architecture, we can insert that into the roadmap without disrupting their operations,” Salanki adds. “Because our infrastructure is interconnected via backbone and storage replication, we can burst workloads across geographies and maintain data integrity.”

This flexibility is critical for firms like Jane Street, which require research compute in one region while supporting latency-sensitive trading workloads in another. The ability to test next-generation hardware at the earliest opportunity and then scale that environment globally has helped to drive faster iteration cycles and reduce time-to-market for AI-driven trading strategies.

Resilience comes from engineering discipline, not just scale

Operating at this scale is not simply a matter of adding more hardware; it requires a comprehensive approach. It requires precise orchestration and continuous verification of every node in the fleet. CoreWeave’s lifecycle controllers manage firmware updates, hardware burn-in, and predictive diagnostics for every server. This ensures that GPUs are not only working but performing to a standard that avoids slow nodes dragging down distributed jobs.

“We have developed our own training simulations to stress-test clusters before they are deployed,” Salanki says. “Even nodes that pass manufacturer tests might show degraded performance in real-world AI workloads. We reject those nodes.”

This rigorous approach extends to operational monitoring. Controllers run inside customer clusters, tracking temperature curves, networking anomalies, and subtle changes in GPU behaviour. If a node is likely to fail, it is automatically removed and replaced. No ticket is raised, and no job is interrupted. “We maintain buffer capacity so that customers do not have to over-provision,” Mancuso explains. “When a node fails, it is rotated out and replaced with minimal disruption. That allows our customers to focus on what matters, building and testing better models.”

Observability unlocks better capacity planning

Transparency is not limited to security and logs. CoreWeave provides customers with extensive observability into cluster performance, workload efficiency, and resource utilisation. For financial firms where margins are defined by milliseconds, the ability to optimise GPU throughput and iterate faster on training pipelines is essential.

“Our observability platform supports capacity planning in a way that static infrastructure never could,” Salanki says. “Customers can understand not only what resources they used, but how efficiently they used them, and feed that data back into their procurement and deployment strategies.”

The result is a closed-loop optimisation process. Model teams can fine-tune resource allocation, experiment with new architectures, and access new chips as soon as they become available. Infrastructure becomes a strategic enabler of research, not a constraint.

Proximity, support and partnerships matter more than promises

For all the technical sophistication, it is the human element that defines how well AI infrastructure performs in practice. The success of Jane Street’s AI initiatives has not only relied on architecture but also on responsiveness and communication. “When there is a problem, we connect customers directly with the engineers who built the system,” Mancuso continues. “There is no support ticket purgatory. There is a Slack channel. We talk daily. That is how we keep systems running and projects moving.”

The transparency extends to feature roadmaps and global buildouts, enabling customers to plan with confidence around infrastructure changes. More importantly, CoreWeave is not afraid to challenge customers when needed. “If a training pipeline is inefficient or a deployment pattern looks fragile, we will flag it,” Salanki concludes. “That is how you build partnerships that last.”

Financial services firms are no longer asking if they should use AI. They are asking how to build the fastest, most reliable, and secure foundation for it. The answer is not more cloud. It is an infrastructure designed explicitly for accelerated computing, operated with transparency, and scaled in lockstep with innovation. That is where the next gains in trading speed, model accuracy and operational efficiency will be won.