The AWS Outage and the Return of the Single Point of Failure

When the internet was built, the promise was decentralisation

The internet was conceived as a distributed network of networks, with no single node, no single path and no central fail-switch. In the original design of TCP/IP, resilience through decentralisation was the defining principle. If a route failed, traffic could find another path. The loss of one link didn’t bring everything down.
That design carries a lesson for today’s cloud-era architecture. When we push all of our applications, data and services through a single provider, region or platform, we are re-creating the very fragility the internet was built to avoid. We are turning distributed systems into centralised dependencies, only this time they are hidden behind layers of convenience and abstraction.
It has been fifty-five years since ARPANET first went live. What began as an experiment in resilience and autonomy has evolved into an ecosystem dominated by a handful of global platforms. In some ways, we have come full circle. The technology has advanced beyond imagination, yet the decentralisation principle that made the internet survive in the first place decentralisation has quietly been abandoned.
AWS outage shows single-platform concentration risk
On 20 October 2025 the AWS us-east-1 region suffered an “operational issue” that caused elevated error-rates and latency across multiple AWS services. (The Verge)
The impacts included:
- Streaming/gaming platforms such as Fortnite, social apps like Snapchat, cloud-based assistants like Alexa. (The Verge)
- Banking/log-in services such as Lloyds Bank and other UK financial services faced login/disruption issues. (Financial Times)
- Smart-home devices (e.g., Ring doorbells) and other IoT systems also impacted. (The Guardian)
In short: the ripple effect of one region’s problem at AWS cascaded across many platforms, many verticals, many countries. Which leads to the central message: we must never assume “public cloud = bullet-proof”.
Why deploy to one of the hyper-scalers?
Over the past decade, AWS has become the default answer to “where should we deploy?” It’s safe, it’s modern, it looks good on your CV, and most importantly, everyone else is doing it.
Many corporate decisions to “go all-in on the cloud” weren’t born purely from technical reasoning but many were career decisions wrapped in strategy decks. I have personally seen the justification of moving to the cloud from their on-premises for a major bank, and it was only a short paragraph! Public cloud became synonymous with innovation. Multi-cloud strategies were dismissed as “too complex”, “too expensive”, or “anti-agile” and nobody questions them.
But no matter how shiny these platforms are, they remain software and hardware built and run by people. People make mistakes. People deploy bad code. Hardware fails. Networks partition.
And when one service (say, DynamoDB or S3) underpins dozens of higher-level AWS services, a fault in one component can cascade into global failure.
Public cloud is neither invincible nor infinitely resilient
AWS is an extraordinary platform, and we absolutely advocate using it where it makes sense. But at the end of the day, it is still a collection of hardware, networks and software run by teams of people. It is not magic, and it is not immune to failure.
The risks are structural. Every cloud service is built on other cloud services, creating deep layers of dependency. A managed database might rely on storage, networking, load balancers and the control plane beneath it. When one of those layers falters, the effect can ripple through everything above it.
There is also the regional risk. Many companies concentrate most of their infrastructure in a single region such as us-east-1. They have nowhere to go when that region has a problem.
Vendor concentration adds another layer of fragility. When too many business-critical workloads are tied to one provider, that provider’s problem becomes your outage.
And finally, there is the weight of complexity itself. Each additional abstraction or managed service introduces another place where things can go wrong, often in ways that are hard to see until it is too late. What looks efficient on the surface can, in practice, behave like a tightly coupled system where one small fault triggers a much larger cascade.
Some historical examples

- In December 2021, AWS suffered a large outage in us-east-1 (Dec 7 & 10) that lasted more than 8 hours and impacted consumer appliances, business services and cloud clients globally. (ThousandEyes)
- In February 2017, AWS’s S3 service in the Northern Virginia (us-east-1) region experienced a major outage. (Amazon Web Services)
- More broadly, downtime costs are eye-watering: The ITIC estimates that 90 % of enterprises face costs exceeding US$300k per hour of downtime; 41 % face more than US$1-5 million per hour. (AWS Documentation)
- Another study by Gartner (via StatusCake) found average cost of website downtime at US $5,600 per minute (≈ US $336k per hour) in 2014. (StatusCake)
The Lemming Effect: herd behaviour at scale
Another thing worth mentioning here, and not many decision makers are unaware of, is what I call the Lemming Effect. When everyone builds on the same platform, we create an ecosystem that behaves like a colony of lemmings all charging toward the same cliff.
The cloud world has followed a pattern of herd adoption:
- Start-ups move to AWS because investors expect it.
- Enterprises follow because their competitors have.
- Government, healthcare, and finance join in because “cloud first” becomes policy.
When an outage hits, everyone scrambles in the same direction again: restarting instances, failing over within the same region, hammering status dashboards and APIs already under stress.
The result is a secondary wave of self-inflicted load. Recovery systems collapse under the surge. Even customers who weren’t directly affected feel the tremors because the control planes get flooded with retries and failovers.
This is the Lemming Effect in modern infrastructure: systemic synchronisation of panic.
In a distributed world, independent systems would fail and recover independently. In our current monoculture, everything moves together up and down, amplifying the damage.
The cost to enterprises
When an outage hits a cloud provider (or your dependency on one), the business costs go far beyond just “system offline”. Consider:
- Revenue loss: e-commerce, consumer apps, SaaS platforms may lose transactions, subscriptions, ad-revenue during downtime.
- Customer trust & churn: Users expect reliable service. A major outage undermines trust. Some users might switch to competitors.
- Operational cost: Recovery, forensic analysis, communication, remediation. Internal teams get pulled off roadmap work.
- Regulatory/compliance risk: For sectors like finance, healthcare, government, downtime might violate SLAs or compliance obligations.
- Brand damage: Public perception matters. “If our service is down, why should we trust you?”
- Hidden dependencies: Many firms depend on upstream services; one failure can cascade through an ecosystem of micro-services, 3rd-party APIs, integrations.
- Decision accountability: When management mandates “single cloud provider” but a failure hits, there may be limited accountability for the choice of architecture yet customers/users bear the brunt.
Why we’ve long championed multi-cloud (or hybrid/distributed) approaches
If the internet was built on the idea of decentralisation, then our cloud strategies should reflect the same principle. Real resilience comes from choice and from avoiding dependency on any single provider. It means having the freedom to move, reroute or recover when something goes wrong. Using multiple clouds, regions or a mix of on-premise and edge systems gives that flexibility and prevents one failure from bringing down everything you have built.
Managing a multi-cloud or hybrid environment is not easy. It takes planning, tooling and discipline. Yet the additional effort is minor compared with the cost of complete dependency. When a single region failure can stop your business, the debate about complexity quickly changes tone.
My observation from working with many enterprises is that problem is cultural and rarely technical. Somewhere along the way, convenience became mistaken for reliability. Many teams have come to treat cloud platforms as if their scale alone guaranteed resilience. That belief is misplaced. Public clouds are powerful, but they are still vast, human-built systems with hidden dependencies and potential single points of failure. The right approach is not to avoid them, but to design with failure in mind and to assume it will happen.
Why this matters for mission-critical systems
When we talk about sectors such as government, financial services and healthcare, the stakes are not theoretical. These are the systems that people rely on every day. In healthcare, uptime can be a matter of life and death. In finance, people depend on uninterrupted access to their money, payments and savings. In government, citizens expect essential public services to remain available, regardless of what happens behind the scenes.
The problem is that many organisations in these sectors have come to equate the public cloud with high availability. They assume that a single cloud provider can deliver all the redundancy they need. That assumption creates systemic risk. When one of the major cloud providers fails, the effects spread quickly through the ecosystem, taking down dependent systems, partner services and customer applications. What begins as a single technical incident can escalate into a nationwide disruption simply because too many critical systems rely on the same platform.
So what should organisations and architects do? (a checklist
Here are some actionable take-aways:
- Map your dependencies: know which services (internal & external) depend on provider X, region Y.
- Architect for failure: assume an availability zone or region may fail and plan how your system degrades gracefully.
- Multi-region + multi-cloud: where feasible, spread critical loads across more than one region/provider.
- Chaos engineering: practise controlled failure injection to ensure your system behaves under provider/region failure. (AWS themselves reference this.) AWS Documentation
- Service-level agreements and SLAs: understand your cloud provider’s SLA and map your own business risk (cost of downtime, customer impact).
- Organisational buy-in: the decision to concentrate on one provider often comes from career/management incentives (“we have to pick AWS because everyone else does”). Challenge those assumptions with data and risk modelling.
- Runbooks & communication: have processes ready for when the provider has issues; your customers/users will expect transparency.
- Drills and retention plans: know how you'll switch regions/providers or gracefully degrade if needed, and practise the switch.
- Cost vs risk balance: yes, multi-cloud and multi-region cost more in complexity/time but the cost of failure (revenue, brand, trust) may be far higher.
- Review architectural decisions critically: are you using a cloud provider conveniently, or are you implicitly locked in? Are you aware of hidden single points of failure?
Conclusion
Today’s AWS outage is a wake-up call. It reminds us that centralised dependency on a single cloud provider is a real risk. Just because cloud is convenient, ubiquitous, and (in many ways) robust it doesn’t mean it is invincible.
The original internet’s design principle is “distributed, redundant, resilient”, and these are still relevant. If anything, in the cloud era we must apply it more consciously: a single provider failure can ripple through a vast set of businesses, apps and consumers.
Taking the easier path (one provider, one region) may feel efficient until the provider/infrastructure has an issue. Then we face the full cost: operational, financial, reputational, regulatory.
If you’re responsible for architecture, DevOps, reliability, or business continuity, today is the opportunity to revisit assumptions, re-map your dependencies, and ensure you’re not baked into a single failure domain. Because your users, your business, your reputation cannot afford to assume “it will never go down”.