Module 4 · Lesson 1

DNS Caching Strategies and Optimizations

45 minutes

TTL is the most important lever in DNS performance. Most people set it once and forget it. Here's how to use it deliberately.

DNS Caching Strategies and Optimizations

At 2am during a botched migration, someone is going to ask why the old IP is still resolving even though you updated the record an hour ago. The answer is TTL. The follow-up question — "why was it set to 86400?" — is the one that matters.

TTL (Time To Live) controls how long resolvers cache your DNS records. It's not a fire-and-forget setting. It's an operational dial with real consequences in both directions.

The Two Failure Modes

Too low: Every query hits your authoritative servers. At 30 seconds TTL with 10 million daily users, that's roughly 333,000 queries per minute — minimum. You're paying for that in compute and bandwidth, and your authoritative infrastructure needs to handle it. Cache miss rate approaches 100%. This is a query storm in slow motion.

Too high: You made a change. Your users are still getting the old record. Your TTL is 24 hours and you changed the record 23 hours and 55 minutes ago. Good luck.

The right TTL depends on what state you're in.

TTL by Operational Mode

Normal Operations

For records that rarely change — MX, stable A records for infrastructure, NS records — 3600 to 86400 seconds (1 to 24 hours) is reasonable. The higher the TTL, the lower your authoritative query load and the more resilient you are to authoritative server unavailability.

A common pattern:

  • NS records: 86400 (24h) or higher
  • MX records: 3600–14400 (1–4h)
  • A/AAAA for stable services: 3600 (1h)
  • CNAME for CDN endpoints: 300–600 (5–10min, since CDN IPs change)
  • TXT for SPF/DKIM: 3600 (SPF is evaluated live, not cached by mail servers)

Pre-Change Window

Before you make any significant DNS change, lower your TTL — ideally at least 2x the current TTL before the change. If your record is at 3600, drop it to 300 twenty-four hours before you need to flip it. Then when you update the record, you wait at most 5 minutes for the world to see it.

This is the step that most teams skip because they forgot, then spend three hours explaining to a VP why "the DNS hasn't propagated yet."

The workflow:

  1. T-24h: Lower TTL to 300
  2. T-0: Make the change
  3. T+5min: Verify globally (use whatsmydns.net or a distributed checker)
  4. T+1h: Raise TTL back to 3600

Active Incident

Your service is down. The fix is a DNS change. Your TTL is 3600 and you updated the record 10 minutes ago.

You cannot speed up cache expiry in resolvers you don't control. What you can do:

  • Change the record (already done)
  • Wait — most resolvers check TTL eagerly when the record is close to expiry
  • Test with dig +norecurse @8.8.8.8 yourdomain.com to see what Google's resolver is serving

In some cases you can update the record to a new hostname and update your application config — sidestepping the stale cache entirely.

Negative Caching: The NXDOMAIN Problem

Negative caching is defined in RFC 2308. When a resolver gets NXDOMAIN (domain not found), it caches that negative result for the duration specified in the SOA record's minimum TTL field.

example.com. 3600 IN SOA ns1.example.com. admin.example.com. (
    2024010101 ; serial
    3600       ; refresh
    900        ; retry
    604800     ; expire
    300        ; minimum TTL (negative cache)
)

That last value — 300 in this example — is how long resolvers cache NXDOMAIN responses.

Why it matters: if you're deploying a new subdomain and something queries it before the record exists, the NXDOMAIN gets cached. New queries for that name return NXDOMAIN from cache for minimum TTL seconds, even after you've added the record.

Common failure mode: CI/CD pipeline queries a hostname during deployment before DNS is ready. The resolver caches NXDOMAIN. Subsequent health checks fail for 5 minutes because the negative cache hasn't expired. Your deployment pipeline reports failure. You go investigate and find everything is fine. You wasted 45 minutes.

Keep minimum TTL at 60–300 seconds. 3600 for negative cache TTL is too aggressive.

Resolver Cache Sizing

If you're running your own recursive resolvers (unbound, BIND, PowerDNS Recursor), cache sizing matters.

The formula for minimum useful cache size:

cache_size = QPS * average_TTL * average_record_size

For a resolver handling 5,000 QPS with an average TTL of 300 seconds and average record size of ~100 bytes:

5,000 * 300 * 100 = 150,000,000 bytes = ~143 MB

In practice, unbound defaults to 4MB. For any production resolver handling meaningful traffic, set msg-cache-size and rrset-cache-size to at least 256MB each. A well-cached resolver for a medium-size organization should have 1–2GB allocated.

# unbound.conf
server:
    msg-cache-size: 512m
    rrset-cache-size: 1024m
    cache-min-ttl: 30
    cache-max-ttl: 86400

cache-min-ttl: 30 prevents records with absurdly low TTLs from evicting everything else from your cache. If a CDN publishes a 1-second TTL record, you don't want your cache thrashing.

Pre-Warming a Resolver

When you bring up a new recursive resolver, the cache is cold. Every query is a cache miss, hitting authoritative servers. Response times are slower, authoritative load spikes.

Pre-warming strategies:

Replay DNS logs: If you have query logs from an existing resolver, replay them against the new one before cutting traffic over. Tools like dnsreplay (part of PowerDNS suite) can do this.

Staged traffic shift: Shift 5% of traffic to the new resolver, let it warm for 30 minutes, then increase. Monitor cache hit ratio as you go.

Prefetch popular records: Extract the top 1,000 queried names from your logs and pre-query them:

while read -r name; do
    dig "$name" @new-resolver.internal > /dev/null
done < top-queries.txt

Crude but effective.

The Math on Cache Hit Rates

A well-configured resolver in a corporate environment should achieve 70–85% cache hit rate. Public resolvers serving diverse populations (8.8.8.8, 1.1.1.1) hit 90–95%+ because of the sheer query volume.

To measure your cache hit rate with unbound:

unbound-control stats | grep -E "(cache.hits|cache.misses|num.queries)"

In BIND:

rndc stats && grep -E "cache hits|cache misses" /var/named/data/named_stats.txt

If your hit rate is below 50%, investigate:

  • Cache size too small (records being evicted)
  • Too many unique names being queried (bots, crawlers, randomized subdomains)
  • TTLs on frequently-queried records set too low
  • NXDOMAIN responses for non-existent names (could indicate malware doing DNS beaconing)

Key Takeaways

  • Lower TTL before changes, not during. The window is 2x current TTL.
  • Negative caching (NXDOMAIN) trips up deployments. Keep SOA minimum TTL at 60–300 seconds.
  • Default resolver cache sizes are too small. Allocate 256MB–2GB depending on load.
  • A cache hit rate below 70% means something is wrong — investigate before assuming it's normal.
  • During incidents, you cannot force cache expiry in resolvers you don't control.

Further Reading

Up Next

Anycast DNS: Improving Resilience and Performance — how to route queries to the nearest healthy node without any client-side configuration.