Module 4 · Lesson 5

DNS Scalability: Handling High-Volume Traffic

50 minutes

What high-volume actually means, where the bottlenecks are, and how to scale past them without reinventing what Cloudflare already solved.

DNS Scalability: Handling High-Volume Traffic

Cloudflare processes roughly one trillion DNS queries per day. That's about 11.5 million queries per second. Your operation probably isn't at that scale, but understanding where the bottlenecks are — and when you'll hit them — is part of designing infrastructure that doesn't fall over at 3x your current load.

What "High Volume" Actually Means

Scale is relative. Here's a rough taxonomy:

ScaleQPS rangeTypical setup
Small< 1,000Single resolver, single auth server
Medium1,000–50,000Multiple resolvers, managed DNS
Large50,000–1,000,000Anycast auth, distributed resolvers, dedicated hardware
Hyperscale> 1MCustom DPDK/XDP stack, specialized hardware

For most organizations running their own infrastructure, "high volume" becomes a problem somewhere in the 10,000–100,000 QPS range, depending on the hardware.

A modern x86 server running unbound or PowerDNS Recursor can handle 50,000–150,000 QPS with a warm cache on commodity hardware. Authoritative servers (PowerDNS Authoritative, BIND, NSD) handle 100,000–500,000 QPS for simple A/AAAA lookups from a PostgreSQL or LMDB backend.

Where the Actual Bottlenecks Are

Before you scale out, identify where you're actually constrained.

CPU: DNS processing is CPU-intensive for new queries (DNSSEC validation especially). For cache hits, it's cheap. If CPU is saturating before your network bandwidth, you need more CPUs or smarter caching.

Network I/O: Each DNS query is small (typically 40–512 bytes), but at high volume, packet rate matters more than bandwidth. A 1Gbps NIC can handle about 1.4 million 512-byte packets per second, but the kernel's interrupt overhead for packet processing saturates around 200,000–500,000 PPS for software networking stacks. Above that, you need SR-IOV, DPDK, or XDP.

Backend latency (for authoritative): If your authoritative server queries a database for every lookup, that database becomes the bottleneck. Solution: in-memory zone storage (NSD, BIND with zone loaded in memory) or aggressive query caching in the DNS daemon.

TCP connection overhead: DNS over TCP (for large responses, AXFR, or DoT) is significantly more expensive than UDP. If you're serving large DNSSEC-signed responses, zone transfers, or DoT at scale, TCP connection handling becomes a bottleneck before query processing does.

Horizontal Scaling of Resolvers

Recursive resolvers scale horizontally by adding more nodes. Each node maintains its own cache (no shared cache required — cache coherence at DNS scale is not worth the complexity).

Load balancing options:

Anycast (best for inter-datacenter): Each PoP runs its own resolvers. BGP routes clients to the nearest PoP. Covered in Lesson 02.

DNS round-robin (simple, but not smart): Publish multiple IPs in /etc/resolv.conf or DHCP. The OS distributes queries across them. No health checking — if a resolver goes down, some clients will time out until their stub resolver fails over.

dnsdist (recommended for intra-datacenter): dnsdist is a DNS load balancer and proxy from PowerDNS. It does health checking, connection tracking, and can distribute queries across a pool of resolvers with multiple algorithms (round-robin, weighted random, hashed by query name for cache affinity):

-- dnsdist.conf
newServer({address="192.168.1.10:53", pool="recursive"})
newServer({address="192.168.1.11:53", pool="recursive"})
newServer({address="192.168.1.12:53", pool="recursive"})

setServFailWhenNoServer(true)
setMaxTCPClientThreads(1000)

-- Route everything to the recursive pool
addAction(AllRule(), PoolAction("recursive"))

Rate Limiting

At scale, you will see abuse: DNS amplification attempts, query floods from misconfigured clients, recursive queries from unauthorized networks.

Rate limiting in dnsdist:

-- Block clients querying more than 50 qps
addAction(MaxQPSIPRule(50, 32, 48), DropAction())

-- Per-subnet rate limit for recursive queries
addAction(MaxQPSIPRule(1000, 24, 32), SetNoRecurseAction())

In PowerDNS Recursor, max-qps-per-client and max-qperzone apply limits at the resolver level.

For authoritative servers, Response Rate Limiting (RRL) is built into BIND, NSD, and PowerDNS Authoritative:

# BIND named.conf
rate-limit {
    responses-per-second 15;
    window 5;
};

RRL truncates or drops responses to clients that exceed the threshold, reducing amplification potential.

The Hidden Primary Pattern for Authoritative Scale

For authoritative DNS at scale, the architecture is:

Hidden Primary (unicast, not public)
    |
    |-- IXFR/AXFR --> Anycast Node 1 (PoP: Frankfurt)
    |-- IXFR/AXFR --> Anycast Node 2 (PoP: Singapore)
    |-- IXFR/AXFR --> Anycast Node 3 (PoP: Virginia)
    |-- IXFR/AXFR --> Anycast Node 4 (PoP: São Paulo)

The hidden primary:

  • Is the only place zone changes are made
  • Notifies all slaves immediately on change (NOTIFY message per RFC 1996)
  • Is never in public NS records
  • Should be behind a firewall allowing only zone transfer connections from known slave IPs

Zone transfer protocol choice:

AXFR (full transfer): Transfers the entire zone every time. Safe, simple, works for small zones. Inefficient for large zones (millions of records) where only a few records changed.

IXFR (incremental transfer): Transfers only changes since the last serial. Efficient. Requires slaves to maintain change history. Fallback to AXFR if the slave is too far behind.

For BIND:

# Primary named.conf
zone "example.com" {
    type primary;
    file "example.com.db";
    notify yes;
    also-notify { 192.0.2.10; 192.0.2.11; };
    allow-transfer { 192.0.2.10; 192.0.2.11; };
};

For a zone with 500,000 records and frequent changes, configure IXFR to avoid full zone retransfers on every minor update.

Capacity Planning

A simple capacity planning formula for recursive resolvers:

Required resolvers = (peak QPS) / (QPS per resolver) * (1 + headroom factor)

If your peak is 30,000 QPS, each resolver handles 50,000 QPS, and you want 50% headroom:

30,000 / 50,000 * 1.5 = 0.9 → round up to 2 resolvers minimum

In practice, always run at least 3 for redundancy (N+2 for DNS — two can go down and you're still serving). Size for peak, not average.

For authoritative servers, the calculation is simpler because there's no cache — every query hits the data store. Benchmark with dnsperf (covered in Lesson 07) against your actual zone data before deployment.


Key Takeaways

  • A single modern server handles 50,000–500,000 QPS depending on role (recursive vs authoritative) and query mix.
  • The bottlenecks are CPU (DNSSEC), packet rate (kernel networking), and backend latency (authoritative with database backend).
  • dnsdist is the standard DNS load balancer for intra-datacenter distribution. Use it.
  • IXFR over AXFR for large zones with frequent changes. Always.
  • Size for N+2 redundancy, not N+1. Two resolvers going down simultaneously is a real scenario.

Further Reading

Up Next

Disaster Recovery and Business Continuity for DNS — what happens when your DNS provider goes offline, and how to not be the next Dyn story.