When the cloud goes dark

Share this article

The global AWS outage revealed more than a temporary loss of service. It exposed the fragility of centralised cloud systems and the urgent need for a more intelligent, distributed, and resilient approach to the infrastructure powering the AI economy.

When the world’s biggest cloud service stumbles, the entire digital economy feels the tremor. On the morning of October 20, a routine domain name system (DNS) issue inside Amazon Web Services triggered one of the largest outages in recent memory, taking down household names from Alexa and Snapchat to Lloyds Bank, Slack, Fortnite, and the UK’s tax authority. For hours, much of the internet simply went dark.

It was a stark reminder of how deeply our systems are entangled with a handful of hyperscale providers. Every connected device, app, and AI model that relies on cloud infrastructure depends on these unseen backbones of computation and data. The AWS outage may have lasted less than a day, but its implications will linger far longer, exposing an uncomfortable truth: the cloud has become too centralised, too concentrated, and too critical to fail.

This was not a story of technical misfortune, but one of systemic risk. In an era when businesses are racing to embed AI into every decision, dependency on a single cloud can quickly turn from a convenience into a liability.

The illusion of infinite uptime

The allure of the public cloud lies in its simplicity: elastic capacity, global reach, and the promise of near-perfect reliability. Yet, as the outage demonstrated, no cloud is immune to failure. “The incident is a stark reminder that even the largest and most reliable cloud providers can experience significant outages,” Jake Madders, Co-founder and Director at Hyve Managed Hosting explains. “The key lies in building resilience into your infrastructure from the outset. Diversifying across multiple cloud providers and geographic regions is essential to ensure redundancy and enable seamless failover when disruption occurs.”

That advice cuts to the core of modern digital architecture. AI workloads thrive on availability and continuity. Training models across distributed data sources, coordinating inference across regions, and ensuring uninterrupted access to data are all vital for enterprise operations. A single point of failure, whether caused by a technical fault or regional bottleneck, can cascade across global systems within seconds.

The most effective safeguard is to decouple critical services such as identity management, DNS, and core data layers from any single provider. In practical terms, this means ensuring that the AI pipeline, from data ingestion to model deployment, can survive the failure of any one ecosystem. “Effective mitigation also includes regular backup and recovery testing, automated failover processes, and a well-documented, frequently reviewed incident response plan,” Madders adds. “While large enterprises may have the internal resources to implement and manage these safeguards, smaller businesses may struggle. Partnering with trusted infrastructure specialists can bridge that gap.”

Building resilience through diversity

The outage reignited debate over the dominance of Amazon and Microsoft, which together hold most of the UK’s cloud market. Such concentration creates a systemic vulnerability that extends beyond any one failure. “The AWS outage shows that this duopoly and dominance create huge risk,” observes Mike Hoy, CTO at Pulsant. “There is a pressing need for a regulatory framework that encourages diversity in cloud options. UK organisations should be empowered to choose services that best meet their needs, whether domestic or global.”

Recent findings from Pulsant show that 87 per cent of businesses plan to partially or fully repatriate workloads within the next two years, a dramatic rise from just 43 per cent in 2021. The motivation extends beyond resilience. Cost savings, lower latency, and greater control over where and how data is processed are all driving a rebalancing of infrastructure.

Hybrid cloud models, where workloads are distributed across public, private, and colocation environments, are becoming the new normal. However, hybrid resilience introduces its own complexity. “One platform might back up every five minutes, another every four hours,” says Hoy. “Without a unified recovery strategy, the slowest system sets the pace. Consistency is the biggest challenge in hybrid recovery. Disparate recovery points across platforms can undermine the entire plan.”

True resilience demands orchestration. AI-driven observability tools, for instance, can help synchronise multi-cloud environments, automatically rebalancing workloads and maintaining data integrity across regions. This blend of human oversight and machine intelligence will become central to managing future cloud ecosystems.

When outages become opportunities for attack

Beyond the operational and economic disruption, the AWS outage also opened another front: cybersecurity. “Cybercriminals and hackers can easily take advantage of these types of outages to deploy an array of social engineering attacks,” warns Stefanie Schappert, Senior Journalist at Cybernews. “Hackers who rely on mass urgency and panic will see this as an opportunity to take advantage of people’s heightened emotions with phishing emails offering to ‘fix’ the issue and get you back online.”

Periods of downtime are fertile ground for exploitation. Users desperate to restore access to cloud applications can be lured into fake login pages, malicious downloads, or fraudulent password resets. AI-enabled phishing campaigns can now generate personalised lures at scale, using natural language to impersonate IT teams or service providers.

The risk is compounded by automation. As AI agents begin to perform system administration tasks, their own permissions and actions could be manipulated through cleverly crafted prompts or data injections. For organisations dependent on AI-driven operations, resilience must therefore extend beyond infrastructure to include secure behaviour and governance.

Dr Martin Kraemer, CISO Advisor at KnowBe4, argues that the human-AI interaction layer is now a critical point of defence. “While AI agents offer efficiency gains, the inherent security and privacy challenges make rugged training for both humans and AI a necessity,” he explains. “Establishing strong AI and security governance, promoting literacy, and fostering secure behaviours are key to mitigating the cascading effects of malicious nodes in human-AI networks.”

A blueprint for AI-era resilience

The AWS outage will not be the last. As cloud platforms expand to accommodate increasingly data-hungry AI systems, the probability and impact of disruption will grow. The scale of computational demand, measured in megawatts of power and petabytes of data, will test every layer of digital infrastructure, from cooling and energy supply to software interoperability.

Resilience in this new era is not merely about uptime. It is about designing infrastructure that mirrors the adaptive qualities of AI itself: distributed, self-healing, and context aware. This means embedding intelligence across the stack, from predictive monitoring of hardware to automated decision-making for workload migration. It also means rethinking governance, ensuring that transparency and competition underpin how cloud services are bought, integrated, and regulated.

For the UK, this is both a challenge and an opportunity. A more diverse cloud ecosystem could stimulate domestic innovation, encourage regional data sovereignty, and create space for new entrants that specialise in high-performance, AI-optimised workloads. But diversity must be matched with interoperability. The future cloud cannot simply be a collection of isolated silos; it must be a federation of systems capable of learning and responding as one.

The outage of October 2025 was a warning that resilience cannot be outsourced. The businesses that thrive in the coming AI decade will not be those that assume continuity but those that engineer for disruption. Whether through multi-cloud architectures, intelligent recovery planning, or secure AI governance, the message is the same: in an intelligent world, survival belongs to those who design for failure.

Related Posts
Others have also viewed

The next frontier of start-up acceleration lies in the AI tech stack

The rise of generative and agentic AI has redefined what it means to start a ...

Quantum-centric supercomputing will redefine the AI stack

Executives building for the next decade face an awkward truth, the biggest AI breakthrough may ...

The invisible barrier that could decide the future of artificial intelligence

As AI workloads grow denser and data centres reach physical limits, the real bottleneck in ...
Into The madverse podcast

Episode 21: The Ethics Engine Inside AI

Philosopher-turned-AI leader Filippo explores why building AI that can work is not the same as ...