Module 4 · Lesson 6
Disaster Recovery and Business Continuity for DNS
⏱ 55 minutes
What to do when your DNS provider goes down. If you're reading this during an incident for the first time, good luck — and come back when it's over to implement what you should have done before.
Disaster Recovery and Business Continuity for DNS
In October 2016, Dyn — a major DNS provider — was taken down by a DDoS attack. The outage lasted most of a day. Twitter, GitHub, Reddit, Airbnb, and hundreds of other services were unreachable from large parts of the internet. The root cause wasn't a complicated technical failure. It was that these companies had a single point of failure in their DNS provider, with no failover.
The lesson from that incident was not "use a better DDoS-resistant DNS provider." It was "don't have a single DNS provider at all."
This lesson is about building DNS that survives when a provider goes down, when your infrastructure fails, or when someone makes a configuration change at the worst possible time.
The Single-Provider Problem
If all your NS records point to one provider's nameservers, that provider is a single point of failure. When they have an outage — planned, accidental, or attack-driven — your domain stops resolving for the duration.
The fix is multi-provider DNS: your NS delegation includes nameservers from at least two independent providers.
example.com. NS ns1.provider-a.net.
example.com. NS ns2.provider-a.net.
example.com. NS ns1.provider-b.net.
example.com. NS ns2.provider-b.net.
When a recursive resolver queries for example.com, it receives all four NS records and can try any of them. If provider-a is unreachable, the resolver automatically falls back to provider-b within one retry cycle (typically 1–3 seconds).
Constraints
The multi-provider setup has operational requirements:
- Both providers must serve identical zone data
- Zone changes must propagate to both providers simultaneously (or within the same TTL window)
- NS records at both providers must include all nameservers (or at minimum, each must be able to answer for the full zone)
Secondary DNS Configuration
The standard architecture is a primary-secondary setup. Your primary zone is authoritative; secondary providers receive zone transfers and serve the data.
Option 1: Hidden Primary + Multiple Public Secondaries
Hidden Primary (your server, unicast)
|
+-- IXFR/AXFR --> Provider A (public NS)
+-- IXFR/AXFR --> Provider B (public NS)
+-- IXFR/AXFR --> Provider C (optional third provider)
The hidden primary is your source of truth. It notifies secondaries on every change. Secondaries transfer the zone and serve it publicly.
If your primary goes down, secondaries continue serving the last-transferred zone data for the duration of the zone's expire TTL (typically 604800 seconds / 7 days in the SOA record). That's your window to recover.
Option 2: API-Based Multi-Provider Sync
If your primary DNS is managed through a provider's API (Route 53, Cloudflare, NS1), you can use a tool like octodns to push zone changes to multiple providers simultaneously:
# octodns config
providers:
route53:
class: octodns_route53.Route53Provider
access_key_id: env/AWS_ACCESS_KEY_ID
secret_access_key: env/AWS_SECRET_ACCESS_KEY
cloudflare:
class: octodns_cloudflare.CloudflareProvider
token: env/CF_API_TOKEN
zones:
example.com.:
sources:
- route53
targets:
- cloudflare
octodns sync pushes changes from Route 53 to Cloudflare. Both providers serve the zone. If either goes down, the other handles traffic.
TTL Strategy for Planned Maintenance
Before any planned maintenance that involves DNS changes or your DNS provider:
72 hours before: Lower TTLs on critical records to 300 seconds (5 minutes). If you're using a 3600 or 86400 TTL, this gives resolvers time to cache the shorter TTL before the change window.
Change window: Make the DNS change. Wait 5–10 minutes. Verify globally.
After stabilization: Raise TTLs back to operational values.
Before planned provider maintenance: If you know provider-a is doing maintenance Friday night, shift all DNS traffic to provider-b by removing provider-a's NS records from the delegation temporarily.
This is an operational discipline that requires calendar reminders, not DNS automation. Most DNS incidents during maintenance are caused by someone skipping the TTL-lowering step.
RTO/RPO for DNS
Recovery Time Objective (RTO): how quickly can you restore DNS service after a failure?
Recovery Point Objective (RPO): how much data (zone changes) can you lose?
For multi-provider DNS with current zone transfers:
- RTO: 0–90 seconds (BGP convergence for anycast rerouting) if secondary provider is already serving the zone
- RPO: Time since last successful zone transfer (seconds to minutes with IXFR and NOTIFY)
For single-provider DNS with no DR:
- RTO: Duration of the provider's outage (hours, potentially)
- RPO: Irrelevant — you can't make changes while the provider is down anyway
The goal is RTO under 5 minutes and RPO under 15 minutes. Multi-provider with automated sync achieves this.
The Registrar Lock as a DR Control
One attack vector that's underappreciated: registrar-level changes. If an attacker gains access to your registrar account, they can change your NS records to point to their own servers. This is not a DNS provider problem — it's a registrar problem.
Controls:
- Registry Lock (also called Transfer Lock or Domain Lock): Locks the domain at the registry level, preventing NS changes without an out-of-band verification step. Available from most registrars for an additional fee. Required for high-value domains.
- Two-factor authentication on your registrar account: Non-negotiable.
- Separate accounts for DNS management vs. billing: Limit blast radius if credentials are compromised.
- Alert on registrar changes: Some registrars offer webhooks or email alerts on NS record changes. Subscribe to them.
Registry Lock stops even your own team from making accidental NS changes without a deliberate unlock process. For a production domain, the friction is worth it.
DNS Outage Runbook
Print this. Put it somewhere you can find it when you can't access anything because DNS is down.
Step 1: Confirm the problem is DNS
# Can you reach the authoritative server directly?
dig @ns1.yourprovider.com yourdomain.com A +norecurse
# What are public resolvers returning?
dig @8.8.8.8 yourdomain.com A
dig @1.1.1.1 yourdomain.com A
# Check the delegation
dig +trace yourdomain.com A
Step 2: Identify the scope
Is it:
- Your specific domain? → zone problem or authoritative server problem
- Your entire DNS provider? → provider outage
- Only some resolvers returning bad data? → poisoning or propagation issue
- The registrar's NS records changed? → account compromise
Step 3: Mitigation options
Provider outage:
- If you have a secondary provider already serving the zone: traffic reroutes automatically within BGP convergence time. Monitor and wait.
- If you don't have a secondary: add emergency NS records pointing to a backup provider (Cloudflare, Route 53 free tier) and push your zone manually. Time to restore depends on how fast you can get the zone onto the backup provider and how quickly the new NS records propagate. With 300s TTL, 5–10 minutes. With 86400s TTL, up to 24 hours.
Authoritative server down:
- Restart the service:
systemctl restart namedorsystemctl restart pdns - Check zone file integrity:
named-checkzone example.com /etc/bind/example.com.db - If the server is unreachable: fail over to secondary provider
Registrar compromise:
- Contact registrar's emergency line immediately
- Initiate account recovery
- Request emergency NS revert
- Contact registry if necessary (for Registry Lock cases)
Step 4: Communication
DNS outages affect everything. Communicate early:
- Notify internal stakeholders within 15 minutes of confirmed DNS outage
- Update status page (if you can — note the irony that your status page might be down too if it's on the affected domain)
- Document the timeline in real-time
Key Takeaways
- Single DNS provider = single point of failure. Multi-provider is not optional for production domains.
- The hidden primary + multiple secondaries architecture gives you provider independence with a consistent zone source.
- octodns is the standard tool for API-based multi-provider sync.
- Lower TTLs before maintenance. Every time. Without exception.
- Registry Lock on production domains prevents NS changes without out-of-band verification.
- Your DNS runbook should be accessible without DNS. Print it.
Further Reading
- octodns documentation
- ICANN Registry Lock overview
- Dyn 2016 post-mortem analysis
- RFC 1034 — SOA expire field semantics
- Managing Mission-Critical Domains and DNS — Chapters on registrar and registry relationships, zone redundancy architecture, and continuity planning for production domains.
- Michael Dooley & Timothy Rooney, DNS Security Management (2017) — Covers threat modeling for DNS infrastructure and mitigation architecture that overlaps with DR planning.
Up Next
DNS Performance Metrics and Benchmarking — how to measure what "fast" means for DNS, and how to prove it before production does.