Module 4 · Lesson 8

Case Studies: DNS Failures and Lessons Learned

60 minutes

Dyn 2016. Facebook 2021. Cloudflare June 2022. And a TTL misconfiguration that turned a simple migration into a four-hour incident. What actually went wrong and what changed afterward.

Case Studies: DNS Failures and Lessons Learned

Post-mortems are the best learning tool in operations because they document failure modes that don't appear in documentation. Real systems fail in ways that test plans and architecture diagrams don't anticipate.

This lesson walks through four DNS failures — three large-scale incidents with published post-mortems, and one smaller but more relatable failure that most DNS operators have a version of. The goal is not to assign blame but to extract the patterns that appear across different organizations, different scales, and different failure modes.

Case Study 1: Dyn 2016 — The DR Angle

Date: October 21, 2016 Duration: Approximately 11 hours (intermittent throughout the day) Affected: Twitter, GitHub, Reddit, Airbnb, PayPal, The New York Times, and hundreds of others

What Happened

Dyn, a major managed DNS provider, was hit by a large-scale DDoS attack using the Mirai botnet. Mirai infected IoT devices (DVRs, IP cameras, routers) with default credentials and coordinated them to flood Dyn's infrastructure with UDP traffic. Three waves of attack hit throughout the day. Dyn's authoritative servers were overwhelmed, SERVFAIL responses spiked globally, and queries for domains hosted at Dyn were unanswered.

The security angle (DDoS, botnet, IoT) was covered extensively. The reliability angle is what belongs here.

The DR Failure

The reason the outage affected major services for hours was not solely because Dyn was attacked. It's because their customers had a single point of failure in their DNS provider.

Twitter's DNS at the time ran entirely through Dyn's nameservers. When Dyn was unreachable, twitter.com did not resolve. Full stop. The same for GitHub, Reddit, and hundreds of others. There was no secondary provider to absorb traffic.

The fix for many customers was not waiting for Dyn to recover — it was, in some cases, manually adding NS records from a secondary provider (Cloudflare, Route 53) and hoping the 3600-second TTL on their NS records would expire before Dyn recovered. At the time, many of those records had TTLs of 24 hours or more. There was nothing to do but wait.

What Should Have Been Different

Multi-provider DNS was not new in 2016. RFC 2182 (requirements for nameservers) recommends geographically and topologically distributed nameservers. The architectural recommendation was available. Most of Dyn's large customers chose single-provider because it was simpler.

The actual lessons:

  1. Multi-provider is mandatory for production domains. Not optional. Not "on the roadmap." If all your NS records point to one provider, your domain's availability is bounded by that provider's availability.

  2. NS TTL matters for failover. If your NS records have a 24-hour TTL and you need to change providers during an incident, you're waiting up to 24 hours for the change to propagate. Keep NS TTLs at 3600 or less for production domains.

  3. Test your failover before the incident. Several companies discovered during the Dyn outage that their secondary DNS wasn't properly configured or wasn't serving the zone. Failover you haven't tested is not failover.


Case Study 2: Facebook October 2021 — Why DNS Was the Last Problem

Date: October 4, 2021 Duration: 6 hours, 28 minutes Affected: Facebook, Instagram, WhatsApp, Oculus — all Facebook properties globally

What Happened

Facebook's backbone network went down due to a misconfigured BGP route update during routine maintenance. The misconfiguration caused Facebook's edge routers to withdraw their BGP routes, disconnecting their data centers from the internet.

This is where DNS becomes interesting. When Facebook's BGP routes disappeared, the internet could no longer reach Facebook's authoritative DNS servers. facebook.com stopped resolving globally. But this was not the hard part.

The hard part: Facebook's internal network was also isolated. The engineers who needed to fix the BGP misconfiguration couldn't reach the internal tools to do so — those tools ran inside the network that was now unreachable. Physical access to data center equipment was required. Staff who had badge access were dispatched to data centers. Some had to drill through locked cabinets to reach consoles.

When they could finally make changes, they chose to bring the network back up cautiously — a sudden reconnection of all data centers simultaneously would have caused a massive traffic surge. They staged the recovery, which added hours.

DNS was unavailable for the duration. But DNS was a symptom, not the cause, and not the recovery bottleneck.

The DNS-Specific Failures

External resolvers showed the failure clearly. Within minutes of the BGP withdrawal, Cloudflare 1.1.1.1's SERVFAIL rate for Facebook queries spiked. Recursive resolvers tried to reach Facebook's authoritative servers, got no response, and returned SERVFAIL. Some resolvers kept retrying — flooding a network that was already receiving millions of queries for a domain that couldn't answer.

This created secondary effects: the retry flood of DNS queries contributed to network congestion during recovery, as resolvers worldwide hammered Facebook's nameservers the moment routes came back.

What Changed

Facebook published a detailed post-mortem. Key changes:

  • Better audit controls on backbone configuration changes
  • Out-of-band management network for emergency access, not dependent on the production network
  • "More robust" (their word) BGP change rollout process with staged deployments

For DNS specifically, the lesson is about out-of-band operations: your DNS management plane must be reachable even when your data plane is down. If your DNS provider's control panel is hosted on your own infrastructure, you have a problem. Use a provider whose control plane is independent of your services.

The Recovery Metric That Matters

Six hours to restore a service used by 3.5 billion people. The recovery was not limited by technical capability — Facebook has some of the most capable infrastructure engineers in the world. It was limited by physical access and blast radius fear (bringing everything back at once risked cascading failures).

For DNS: recovery time is bounded by the time to reach your management plane. If your authoritative DNS is a server in a locked rack you can only access via a VPN that runs on the same network, your RTO is not "minutes."


Case Study 3: Cloudflare June 2022 — Routing Misconfiguration

Date: June 21, 2022 Duration: ~35 minutes Affected: Cloudflare customers globally, including Cloudflare's own 1.1.1.1 resolver

What Happened

Cloudflare published a detailed post-mortem. During a network infrastructure project, a change was made to the routing configuration of their backbone. The change was intended to improve resilience. Instead, it caused a change verification step to fail in a way that cascaded: the system restarted the BGP session on the backbone routers, which caused 19 PoPs to drop offline for external traffic. 28 data centers eventually went offline.

Cloudflare's own resolver (1.1.1.1) was affected. Services that depend on Cloudflare for DNS or CDN were unavailable from those 28 locations.

What's Interesting About This One

Cloudflare is one of the most sophisticated network operators in the world. They run anycast DNS at global scale. They have change management processes. They have automated rollback.

The failure happened anyway, because:

  1. The change verification logic had a bug that caused it to misinterpret a valid state as a failure
  2. The automated response to a perceived failure made things worse
  3. The rollback procedure, when triggered, restored the configuration but BGP session restart added propagation delay

The recovery took 35 minutes — fast by most standards. The post-mortem is worth reading in full because it shows how a sophisticated, automated system can fail in non-obvious ways.

The Lessons

Automation can fail fast in both directions. Automated rollback is good when it works. When the automation itself has a bug, it can accelerate the impact of an incident. Test your automation the same way you test your code.

Blast radius must be bounded. Cloudflare's post-mortem notes that the change should have been staged with a smaller initial deployment. A change that affects 19 PoPs simultaneously is a change that can cause a 19-PoP outage simultaneously.

Even anycast has failure modes. Anycast is resilient, but it relies on BGP, and BGP can be misconfigured. The architecture doesn't eliminate the need for careful change management.


Case Study 4: The TTL Misconfiguration — A Migration Gone Wrong

This one isn't famous. It didn't make the news. But it's the incident that most DNS operators have a version of, at some point in their career.

The Setup

A mid-size SaaS company is migrating from one hosting provider to another. The plan:

  1. Set up the new environment
  2. Test it thoroughly
  3. Update the DNS A records to point to the new IPs
  4. Validate
  5. Decommission the old environment

The migration lead checks the TTL on the A records: 86400 seconds. They note this should be lowered before the migration. They add it to the runbook.

During final prep, the runbook step gets skipped. Everyone is focused on the new environment being ready. The TTL is not lowered.

The Incident

The DNS record is updated. The monitoring shows the new environment is resolving correctly from the engineer's machine. The team marks the migration complete and the old environment is spun down two hours later.

Then the support tickets start coming in. Customers can't reach the service.

The problem: with a 24-hour TTL, resolvers cached the old IP for up to 24 more hours after the record was updated. When the old environment was decommissioned, any resolver that hadn't yet expired the cache was sending traffic to a server that no longer existed.

The Response

The team quickly realized the TTL issue. Options at this point:

  • Spin the old environment back up (takes 45 minutes, and the customer data is already migrated)
  • Wait for TTLs to expire (up to 22 hours remaining)
  • Hope that major resolvers like 8.8.8.8 and 1.1.1.1 refresh early (they sometimes do, but you can't rely on it)

They spun up a minimal proxy on the old IP that forwarded requests to the new environment. This took three hours to set up and validate. Total customer impact: four hours.

The Actual Fix

After the incident, the team implemented a mandatory pre-migration checklist:

## DNS Pre-Migration Checklist

- [ ] Identify all DNS records pointing to old infrastructure
- [ ] Lower TTL to 300 seconds at T-48h
- [ ] Confirm TTL change has propagated (wait T-48h + current TTL)
- [ ] Proceed with migration
- [ ] Keep old environment running for TTL duration after DNS change
- [ ] Decommission old environment only after TTL confirmation
- [ ] Raise TTLs back to operational values

The critical step that was missing: keep the old environment running for at least 2x the original TTL after the DNS change. This is the window during which stale cache resolvers are still routing to the old IP.

The Numbers

With an original TTL of 86400 (24 hours), the old environment must stay up for at least 48 hours after the DNS change — to be safe with resolvers that refreshed their cache just before the change and will hold it for the full TTL period.

With a pre-lowered TTL of 300 (5 minutes), the old environment needs to stay up for only 10 minutes after the DNS change.

The cost of the checklist step: 5 minutes to lower the TTL, 48 hours to wait. The cost of skipping it: four hours of customer impact, three hours of emergency work, and an incident review.


Cross-Incident Patterns

Looking across these four incidents, the failure modes repeat:

Single points of failure: Dyn. One provider. No failover.

Management plane on the same network as the data plane: Facebook. You can't fix it if you can't reach the tools.

Changes without staged rollout: Cloudflare. Wide blast radius.

Skipped pre-flight steps: TTL migration. Everyone knew it should be done. Nobody did it.

These patterns appear in incidents at every scale. The fixes are not technically difficult. They require process discipline and the organizational muscle to slow down when the schedule says to go fast.


Key Takeaways

  • Dyn 2016: single DNS provider + no tested failover = full outage. Multi-provider is non-negotiable.
  • Facebook 2021: DNS went down because routing went down. Your management plane must be independent of your data plane.
  • Cloudflare 2022: even sophisticated automation fails. Stage changes. Test rollback. Bound blast radius.
  • TTL migration: keep old infrastructure running for 2x the original TTL after a DNS change.
  • The patterns that cause DNS incidents are the same across organizations and scales. Checklists and staged changes prevent most of them.

Further Reading


This is the final lesson of Module 4. You now have a working framework for DNS reliability: caching strategy, anycast architecture, monitoring and alerting, debugging methodology, scalability patterns, DR planning, performance benchmarking, and a set of failure patterns to watch for. Module 5 covers DNS Security in depth.