AWS US-EAST-1 Outage

On October 20, 2025, a major outage in the AWS US-EAST-1 region triggered widespread global service disruption, affecting hundreds of major platforms including gaming, banking, and smart home services. The incident originated with a DNS resolution failure, which led to a catastrophic cascading failure across critical dependent services like DynamoDB, EC2, and Lambda. The disruption lasted over 20 hours and caused significant secondary waves of failure during the complex recovery process, underscoring the critical need for multi-region redundancy strategies.

1. Exact Timeline and Sequence of Events

Time (PT)	Time (ET)	Event Description
12:11 AM PT	3:11 AM ET	Incident Commencement: The major AWS outage begins, starting with a DNS resolution issue preventing AWS services from properly communicating with the DynamoDB API endpoint.
~12:11 AM PT+	~3:11 AM ET+	Cascading Failure: DynamoDB begins experiencing significant error rates, causing countless dependent AWS tools and customer applications to break down. EC2 instances and Lambda functions fail to launch or execute due to an internal network problem within EC2.
2:01 AM PT	5:01 AM ET	Root Cause Identified: AWS identifies the core DNS resolution issue and begins active mitigation efforts.
3:35 AM PT	6:35 AM ET	Initial DNS Resolution: The immediate DNS issue is initially resolved.
3:35 AM PT	6:35 AM ET	Database Mitigation Declared: AWS declares the DynamoDB problem "fully mitigated." However, services continue to struggle with recovery and resynchronization.
October 21	-	Secondary Disruption Waves: Services continue to experience intermittent issues and multiple waves of disruption as systems attempt full recovery, affecting connectivity and EC2 instance launches throughout the day.
Monday Evening*	-	AWS confirms that operations have “returned to normal,” though minor issues persisted. (*Note: The report context implies the outage spanned across two days, Oct 20-21, with Monday evening likely referring to Oct 21.)

2. Cause of the Outage

The primary cause was an internal subsystem issue within the US-EAST-1 region that manifested as a DNS resolution failure.

Initial Failure: Internal AWS services were unable to properly resolve the domain names for the DynamoDB API endpoint.
Cascading Failure (DynamoDB): DynamoDB, a highly critical and widely used database service, experienced massive error rates due to its inability to communicate properly.
Wider Service Breakdown (EC2/Lambda): As DynamoDB supports countless other AWS services, the failure created a chain reaction. EC2 instances and Lambda functions could not launch or execute correctly, stemming from the EC2 internal network problem.

3. Affected Entities and Services

Category	Affected Services/Platforms	User Impact
AWS Services	DynamoDB, EC2 (virtual machines), Lambda (serverless code), and related APIs and internal networking components.	Prevents core compute and data persistence for customers.
Amazon-owned	Amazon.com, Ring (smart home devices), Alexa (alarms/functionality).	Direct impact on e-commerce, home security, and personal digital assistants.
External Platforms	Fortnite (gaming), Snapchat, Signal, Coinbase (banking/crypto), Venmo, United Airlines, Reddit, Duolingo, Starbucks app, Hinge, Canvas (education platform), and hundreds more globally.	Widespread disruption to communication, finance, travel, education, and social media.
Global Users	Over 8.1 million outage reports filed worldwide: 1.9M in the US, 1M in the UK, 418K in Australia.	Inability to access daily platforms, conduct financial transactions, or use smart devices.
Businesses	2,000+ companies impacted globally.	Loss of revenue, operational downtime, and customer service failures.

4. Impact Highlights

Metric	Detail
Actual Duration	20+ hours and counting (as of October 21), characterized by multiple waves of disruption.
Scope of Impact	Global, despite the root cause being a single regional issue (US-EAST-1).
Current Issues (Oct 21)	EC2 instance launches were still problematic; intermittent connectivity issues persisted across many services.
Mitigation Action	AWS temporarily limited the number of incoming platform requests to stabilize the recovery environment.

5. Key Takeaways and Insights

This incident provides stark lessons in the fragility of centralized infrastructure:

Cascading Failure Model: A seemingly localized DNS resolution failure in a single critical region (US-EAST-1) rapidly escalated into prolonged, global cascading failures across highly dependent services (DynamoDB $\rightarrow$ EC2/Lambda).
The Recovery Paradox: The recovery process itself proved to be almost as disruptive as the initial outage, causing secondary waves of disruption as systems struggled to fully resynchronize and recover.
Centralization Risk: The incident clearly highlights the dangers of relying on a single, centralized cloud region for critical operations. Even with Availability Zones (AZs) within a region, a systemic regional fault can cripple the entire global service mesh.
Mandate for Redundancy: This serves as a critical reminder for businesses to immediately implement comprehensive multi-region redundancy and robust disaster recovery strategies that can failover critical workloads outside of a single geographic region.

6. Insights

The scope of the outage confirmed the increasing reliance of the entire digital economy on a few hyper-scale cloud providers:

“This incident highlights how heavily the internet depends on a few cloud giants - Amazon, Microsoft, and Google. A single regional outage can ripple across the globe in minutes.”

The scale of disruption from a single regional event emphasizes the systemic risk posed by the high concentration of the internet's core infrastructure within these centralized cloud environments.