Disaster Recovery and Business Continuity for DNS

In October 2016, Dyn — a major DNS provider — was taken down by a DDoS attack. The outage lasted most of a day. Twitter, GitHub, Reddit, Airbnb, and hundreds of other services were unreachable from large parts of the internet. The root cause wasn't a complicated technical failure. It was that these companies had a single point of failure in their DNS provider, with no failover.

The lesson from that incident was not "use a better DDoS-resistant DNS provider." It was "don't have a single DNS provider at all."

This lesson is about building DNS that survives when a provider goes down, when your infrastructure fails, or when someone makes a configuration change at the worst possible time.

The Single-Provider Problem

If all your NS records point to one provider's nameservers, that provider is a single point of failure. When they have an outage — planned, accidental, or attack-driven — your domain stops resolving for the duration.

The fix is multi-provider DNS: your NS delegation includes nameservers from at least two independent providers.

example.com. NS ns1.provider-a.net.
example.com. NS ns2.provider-a.net.
example.com. NS ns1.provider-b.net.
example.com. NS ns2.provider-b.net.

When a recursive resolver queries for example.com, it receives all four NS records and can try any of them. If provider-a is unreachable, the resolver automatically falls back to provider-b within one retry cycle (typically 1–3 seconds).

Constraints

The multi-provider setup has operational requirements:

Both providers must serve identical zone data
Zone changes must propagate to both providers simultaneously (or within the same TTL window)
NS records at both providers must include all nameservers (or at minimum, each must be able to answer for the full zone)

Secondary DNS Configuration

The standard architecture is a primary-secondary setup. Your primary zone is authoritative; secondary providers receive zone transfers and serve the data.

Option 1: Hidden Primary + Multiple Public Secondaries

Hidden Primary (your server, unicast)
    |
    +-- IXFR/AXFR --> Provider A (public NS)
    +-- IXFR/AXFR --> Provider B (public NS)
    +-- IXFR/AXFR --> Provider C (optional third provider)

The hidden primary is your source of truth. It notifies secondaries on every change. Secondaries transfer the zone and serve it publicly.

If your primary goes down, secondaries continue serving the last-transferred zone data for the duration of the zone's expire TTL (typically 604800 seconds / 7 days in the SOA record). That's your window to recover.

Option 2: API-Based Multi-Provider Sync

If your primary DNS is managed through a provider's API (Route 53, Cloudflare, NS1), you can use a tool like octodns to push zone changes to multiple providers simultaneously:

# octodns config
providers:
  route53:
    class: octodns_route53.Route53Provider
    access_key_id: env/AWS_ACCESS_KEY_ID
    secret_access_key: env/AWS_SECRET_ACCESS_KEY

  cloudflare:
    class: octodns_cloudflare.CloudflareProvider
    token: env/CF_API_TOKEN

zones:
  example.com.:
    sources:
      - route53
    targets:
      - cloudflare

octodns sync pushes changes from Route 53 to Cloudflare. Both providers serve the zone. If either goes down, the other handles traffic.

TTL Strategy for Planned Maintenance

Before any planned maintenance that involves DNS changes or your DNS provider:

72 hours before: Lower TTLs on critical records to 300 seconds (5 minutes). If you're using a 3600 or 86400 TTL, this gives resolvers time to cache the shorter TTL before the change window.

Change window: Make the DNS change. Wait 5–10 minutes. Verify globally.

After stabilization: Raise TTLs back to operational values.

Before planned provider maintenance: If you know provider-a is doing maintenance Friday night, shift all DNS traffic to provider-b by removing provider-a's NS records from the delegation temporarily.

This is an operational discipline that requires calendar reminders, not DNS automation. Most DNS incidents during maintenance are caused by someone skipping the TTL-lowering step.

RTO/RPO for DNS

Recovery Time Objective (RTO): how quickly can you restore DNS service after a failure?

Recovery Point Objective (RPO): how much data (zone changes) can you lose?

For multi-provider DNS with current zone transfers:

RTO: 0–90 seconds (BGP convergence for anycast rerouting) if secondary provider is already serving the zone
RPO: Time since last successful zone transfer (seconds to minutes with IXFR and NOTIFY)

For single-provider DNS with no DR:

RTO: Duration of the provider's outage (hours, potentially)
RPO: Irrelevant — you can't make changes while the provider is down anyway

The goal is RTO under 5 minutes and RPO under 15 minutes. Multi-provider with automated sync achieves this.

The Registrar Lock as a DR Control

One attack vector that's underappreciated: registrar-level changes. If an attacker gains access to your registrar account, they can change your NS records to point to their own servers. This is not a DNS provider problem — it's a registrar problem.

Controls:

Registry Lock (also called Transfer Lock or Domain Lock): Locks the domain at the registry level, preventing NS changes without an out-of-band verification step. Available from most registrars for an additional fee. Required for high-value domains.
Two-factor authentication on your registrar account: Non-negotiable.
Separate accounts for DNS management vs. billing: Limit blast radius if credentials are compromised.
Alert on registrar changes: Some registrars offer webhooks or email alerts on NS record changes. Subscribe to them.

Registry Lock stops even your own team from making accidental NS changes without a deliberate unlock process. For a production domain, the friction is worth it.

DNS Outage Runbook

Print this. Put it somewhere you can find it when you can't access anything because DNS is down.

Step 1: Confirm the problem is DNS

# Can you reach the authoritative server directly?
dig @ns1.yourprovider.com yourdomain.com A +norecurse

# What are public resolvers returning?
dig @8.8.8.8 yourdomain.com A
dig @1.1.1.1 yourdomain.com A

# Check the delegation
dig +trace yourdomain.com A

Step 2: Identify the scope

Is it:

Your specific domain? → zone problem or authoritative server problem
Your entire DNS provider? → provider outage
Only some resolvers returning bad data? → poisoning or propagation issue
The registrar's NS records changed? → account compromise

Step 3: Mitigation options

Provider outage:

If you have a secondary provider already serving the zone: traffic reroutes automatically within BGP convergence time. Monitor and wait.
If you don't have a secondary: add emergency NS records pointing to a backup provider (Cloudflare, Route 53 free tier) and push your zone manually. Time to restore depends on how fast you can get the zone onto the backup provider and how quickly the new NS records propagate. With 300s TTL, 5–10 minutes. With 86400s TTL, up to 24 hours.

Authoritative server down:

Restart the service: systemctl restart named or systemctl restart pdns
Check zone file integrity: named-checkzone example.com /etc/bind/example.com.db
If the server is unreachable: fail over to secondary provider

Registrar compromise:

Contact registrar's emergency line immediately
Initiate account recovery
Request emergency NS revert
Contact registry if necessary (for Registry Lock cases)

Step 4: Communication

DNS outages affect everything. Communicate early:

Notify internal stakeholders within 15 minutes of confirmed DNS outage
Update status page (if you can — note the irony that your status page might be down too if it's on the affected domain)
Document the timeline in real-time

Key Takeaways

Single DNS provider = single point of failure. Multi-provider is not optional for production domains.
The hidden primary + multiple secondaries architecture gives you provider independence with a consistent zone source.
octodns is the standard tool for API-based multi-provider sync.
Lower TTLs before maintenance. Every time. Without exception.
Registry Lock on production domains prevents NS changes without out-of-band verification.
Your DNS runbook should be accessible without DNS. Print it.

Up Next

DNS Performance Metrics and Benchmarking — how to measure what "fast" means for DNS, and how to prove it before production does.