AWS cloud outage reveals vendor concentration risk

Summary
On 20 Oct 2025 Amazon Web Services (AWS) suffered a major outage centred on its US‑EAST‑1 region in Virginia. A DNS resolution failure affecting DynamoDB endpoints cascaded into failures of IAM, EC2 instance launches and numerous other services — causing around nine hours of disruption and several hours of residual backlogs.
The outage affected thousands of services globally (Snapchat, Ring, Robinhood, McDonald’s mobile ordering, Signal, Fortnite and more) and took down organisations that relied indirectly on AWS through SaaS vendors, payment processors, authentication services and CDNs. Regulators in the UK and EU are treating major cloud providers as critical third parties, increasing operational resilience obligations for financial institutions.
Key Points
- Root cause: DNS resolution failure for DynamoDB endpoints in US‑EAST‑1 that cascaded across AWS control plane services.
- Blast radius: Thousands of services and many organisations were impacted, including those without direct AWS contracts due to supply‑chain dependencies.
- Concentration risk: Market consolidation among AWS, Microsoft Azure and Google Cloud creates single points of failure at regional level.
- Business continuity blind spots: Traditional DR plans often miss external cloud dependencies and shared service single points of failure (IAM, CloudWatch, Systems Manager).
- Regulatory reaction: Financial regulators now classify major cloud providers as critical third parties, requiring mapping of dependencies and resilience measures.
- Practical steps: Improve visibility of dependencies, consider multi‑region/multi‑cloud diversification, elevate governance to board level, rehearse failovers and refine crisis communications.
- Enterprise impact: Resilience is an organisation‑wide capability — finance, legal, customer relations and executives must own cloud risk planning.
Why should I read this?
Short version: if your service touches the cloud — even indirectly — this matters. A single DNS fault in one region can cascade through your SaaS stack and take your apps offline. Read this to get a sharp checklist: map dependencies, test failovers, fund SRE and drag the board into the conversation. We’ve done the heavy reading for you.
