AWS powers more than 90% of Fortune 100 companies and holds roughly 30% of the global cloud infrastructure market. When it goes down, the internet doesn't just slow — entire industries stop functioning.

Here is every major AWS outage since 2017 — what triggered it, what cascaded, who was affected, and what it cost. Every one of these incidents follows the same pattern: a single point of failure triggers a cascade through dependent services, and the organizations that suffered most were the ones that didn't know their dependency chain.

February 28, 2017 — The $150 Million Typo

us-east-1 S3 ~4 hours $150M+ estimated

What happened: An authorized S3 team member was debugging a billing issue and executed a command intended to remove a small number of servers. A mistyped parameter removed far more servers than intended, taking down two critical S3 subsystems — the index subsystem (which manages metadata for all S3 objects in us-east-1) and the placement subsystem (which handles storage allocation). Both subsystems needed to be fully restarted, a process that hadn't been done in years due to S3's massive growth. (AWS Post-Event Summary)

What cascaded: S3's failure rippled upward into every AWS service that depends on it — EC2 instance launches, EBS volumes, Lambda functions, and the AWS Service Health Dashboard itself. AWS couldn't even update their own status page because it was hosted on the broken S3 infrastructure. They resorted to posting updates on Twitter. (The Register)

Who was affected: Slack, Trello, Quora, Medium, Coursera, Docker, Expedia, GitHub Pages, parts of Apple iCloud, and Nest security cameras (which stopped recording video). Apica reported that 54 of the internet's top 100 retailers saw performance degrade by 20% or more. (NPR)

Estimated cost: Cyence estimated the outage cost S&P 500 companies $150 million. US financial services companies alone lost an estimated $160 million. (MIT Technology Review)

The lesson: One mistyped command removed infrastructure that turned out to be foundational. The cascade path was not understood in advance — even by AWS themselves. The status dashboard being hosted on the same infrastructure it monitors meant AWS lost visibility into their own failure.

November 25, 2020 — Kinesis Takes Down CloudWatch, Then Everything Else

us-east-1 Kinesis ~7 hours

What happened: A relatively small addition of capacity to Kinesis front-end servers caused all servers in the fleet to exceed the maximum number of threads allowed by the operating system. This caused the Kinesis front-end fleet to become unavailable. (AWS Post-Event Summary)

What cascaded: Amazon CloudWatch, CloudWatch Logs, and Lambda all depend on Kinesis for real-time data streaming, so those services degraded as well. With CloudWatch impaired, AWS operators lost their primary monitoring capability — making it harder to diagnose and fix the original problem. This created a compounding loop: the failure impaired the tool needed to detect the failure.

The lesson: Internal AWS services share dependency chains that mirror what customers experience externally. Kinesis is a "Tier Zero" dependency — when it fails, the monitoring infrastructure fails with it, creating a blind spot during the most critical moment.

September 26, 2021 — EBS Stuck IO Cascades Across us-east-1

us-east-1 EBS ~8 hours

What happened: Amazon Elastic Block Store experienced a stuck IO issue in a single Availability Zone. The problem originated in one AZ but had cascading effects across services that depend on EBS for storage. (StatusGator)

What cascaded: EC2 instances — both existing and new — were impaired. Because RDS, ElastiCache, and Redshift all depend on EBS volumes, these services degraded as well.

The lesson: Storage is foundational infrastructure. A failure in a single storage subsystem in a single AZ can cascade upward through compute, database, and caching layers.

December 7, 2021 — AWS DDoS-es Itself During Holiday Shopping Season

us-east-1 Internal Network ~4–9 hours

What happened: An automated scaling activity triggered unexpected behavior from a large number of internal network clients, effectively creating a self-inflicted distributed denial of service on AWS's internal network. The congestion immediately impaired monitoring visibility, which meant operators couldn't see the source of the problem. (AWS Post-Event Summary)

What cascaded: The internal network carries monitoring, DNS, authorization services, and parts of the EC2 control plane. Global services homed in us-east-1 — including AWS account root logins, Single Sign-On (SSO), and Security Token Service (STS) — were impaired even for customers in other regions. (Pluralsight)

Who was affected: Netflix, Disney+, Venmo, Instacart, Roku, Kindle, Ticketmaster, Ring doorbells, Amazon Prime Music, and Amazon's own retail logistics apps. Warehouse and delivery workers using Amazon Flex couldn't access their apps during peak holiday season. (Catchpoint)

The lesson: AWS's internal network is a shared dependency for dozens of services. When it's congested, operators lose the tools they need to diagnose problems, creating a feedback loop that extends recovery time. Global services with single-region control planes (us-east-1) affect customers worldwide.

December 15, 2021 — US-West Goes Down One Week Later

us-west-1 / us-west-2 Network / ISP ~1 hour

What happened: Eight days after the December 7 incident, network congestion between the AWS backbone and a subset of Internet Service Providers was triggered by AWS traffic engineering. (Catchpoint)

Who was affected: Auth0, Duo, Okta, DoorDash, Disney, PlayStation Network, Slack, Netflix, Snapchat, and Zoom.

The lesson: Monitoring providers themselves were affected — Datadog, ThousandEyes, New Relic, Dynatrace, and Splunk all reported degradation. When monitoring goes down alongside the infrastructure it monitors, everyone is flying blind.

December 22, 2021 — Power Failure Completes the December Trifecta

us-east-1 Power up to 17 hours

What happened: A data center power outage in us-east-1 caused a cascade of issues. Although the power outage itself was brief, related effects persisted for up to 17 hours for some customers. (Catchpoint)

Who was affected: Slack, Udemy, Twilio, Okta, Imgur, Jobvite, and the New York Court system website.

The lesson: December 2021 demonstrated that AWS outages are not isolated events. Three major incidents in a single month — across different regions and root causes — showed that the risk is continuous, not occasional.

July 28, 2022 — Power Outage in US-East-2 (Ohio)

us-east-2 Power ~3 hours

What happened: A power outage in Availability Zone 1 of us-east-2 caused network connectivity problems for EC2 instances. The power issue lasted about 20 minutes, but service recovery took up to 3 hours. (StatusGator)

Who was affected: Webex, Okta, Splunk, and BambooHR all experienced partial or complete downtime.

The lesson: A 20-minute power event created a 3-hour recovery window. The ratio between trigger duration and actual impact duration is often 5x to 10x — and most business continuity plans underestimate this.

June 13, 2023 — Lambda Outage Hits Media and Transit

us-east-1 Lambda / Connect several hours

What happened: Multiple AWS services reported increased error rates and latencies. Amazon Connect was particularly hard hit — callers couldn't connect, chats failed, and agents had login issues. Lambda experienced significant delays with asynchronous invocations. (StatusGator)

Who was affected: The Boston Globe, the New York MTA, the Associated Press, and a range of enterprise services dependent on Lambda and EventBridge.

July 30, 2024 — Kinesis Architecture Flaw Causes 7-Hour Outage

us-east-1 Kinesis ~7 hours

What happened: A routine deployment exposed a flaw in Kinesis Data Streams' newly upgraded architecture. The system mishandled a workload involving a large number of low-throughput shards, causing cascading shard redistribution that overwhelmed traffic processing. (StatusGator)

What cascaded: CloudWatch Logs, Data Firehose, S3 event notifications, Lambda, ECS, Redshift, and Glue all experienced increased latency and error rates.

The lesson: Kinesis remains a critical internal dependency. Four years after the November 2020 incident, a different Kinesis failure produced a similar cascade pattern — evidence that the dependency architecture hadn't fundamentally changed.

October 20, 2025 — The Largest AWS Outage Since CrowdStrike

us-east-1 DNS / DynamoDB / EC2 hours (cascading) hundreds of billions (est.)

What happened: A DNS resolution failure in us-east-1 prevented customers from reaching AWS services. DynamoDB and EC2 — foundational database and compute services — were directly impacted. (CNN, PBS NewsHour)

What cascaded: Because DynamoDB and EC2 underpin thousands of applications, the cascade was enormous. SQS, Amazon Connect, and multiple control plane services were affected.

Who was affected: Snapchat, Fortnite, Roblox, Duolingo, Canva, Slack, monday.com, Zoom, Robinhood, Coinbase, Venmo, McDonald's app, Starbucks app, United Airlines, Ring doorbells, and Alexa devices. Banking services including Lloyds, Barclays, and Bank of Scotland were disrupted. Downdetector recorded over 6.5 million outage reports across more than 1,000 sites and apps. (FinTech Magazine)

Estimated cost: Catchpoint CEO Mehdi Daoudi estimated the total financial impact would reach into the hundreds of billions of dollars when accounting for lost productivity, delayed business operations, and cascading effects across airlines, factories, and payment systems. (CNN)

The lesson: In 2025, a single DNS failure in one region can still cascade globally across every industry. Eight years after the S3 typo, the fundamental risk hasn't changed — it's gotten larger as more services depend on AWS.

The Pattern: Why Every Outage Looks the Same

Every major AWS outage since 2017 follows the same cascade pattern:

A foundational service fails — S3, Kinesis, EBS, DynamoDB, DNS, or the internal network. These are Tier Zero dependencies.

Dependent services degrade — EC2, Lambda, CloudWatch, Connect, and dozens of other services that rely on the failed component begin throwing errors or slowing down.

Monitoring goes blind — In multiple incidents (2017, 2020, 2021), AWS lost visibility into their own systems because the monitoring infrastructure shared dependencies with the failed component.

Recovery takes longer than the trigger — A 20-minute power event creates a 3-hour outage. A mistyped command creates a 4-hour outage. A network congestion event takes 9 hours to fully stabilize. The cascade and backlog always extend the recovery far beyond the initial failure.

The blast radius is unknown until it happens — In every post-mortem, the cascade path was wider than expected. Services that "shouldn't" have been affected were affected because of hidden or undocumented dependency chains.

This is not an AWS-specific problem. It's a dependency problem. And every organization running on AWS — or any cloud provider — has the same exposure in their own service architecture.

How to Simulate This for Your Own Infrastructure

You don't need to wait for the next us-east-1 outage to learn which of your services would fail. You can simulate it today.

Faultline is a business continuity intelligence platform that lets you model exactly the kind of cascade failures documented above — for your own infrastructure.

Simulate "AWS goes down" for your infrastructure

  1. Map your services. Build a visual dependency graph of every service, its criticality level, and what it connects to. Import from Terraform, Docker Compose, Kubernetes, or CloudFormation — or use the guided wizard with industry templates.
  2. Run the cascade simulation. Select your cloud infrastructure node, take it down, set the outage duration, and watch the failure propagate. Faultline shows you exactly which services fail, which degrade (because they have backups), and which survive.
  3. Get concrete numbers. Estimated revenue impact in dollars, maximum recovery time based on RTOs, a count of failed vs. degraded vs. unaffected services, and whether your critical operations boundary is breached.
  4. Generate a recovery playbook. A step-by-step incident response plan built from your actual dependency data — not a generic template.

Every AWS outage in this list would have been less damaging for organizations that knew their dependency chain in advance. The ones that suffered most were the ones that discovered their cascade paths in real time, under pressure, with monitoring down.

The next us-east-1 outage is not a question of if. It's a question of when — and whether you'll know what breaks before it does.