Module 4 · Lesson 7

DNS Performance Metrics and Benchmarking

45 minutes

P99 resolution latency, TTFB impact, and how to run dnsperf before production runs it for you.

DNS Performance Metrics and Benchmarking

There's a version of DNS performance work where you run a benchmark, look at median latency, declare it "fast enough," and move on. That version misses the point. Median latency tells you what a typical user experiences on a good day. P99 latency tells you what 1% of your users experience — which at any meaningful scale is a lot of people, and they experience it on every DNS lookup, multiplied by every request their browser makes.

This lesson is about measuring DNS performance correctly and understanding what the numbers mean.

The Metrics That Matter

Resolution Latency Percentiles

P50 (median): Half of queries resolve faster than this. For a well-warmed local resolver serving from cache, P50 should be under 2ms. For a public resolver (8.8.8.8, 1.1.1.1), P50 for cached responses is typically 5–15ms depending on distance.

P95: 95% of queries resolve within this time. A reasonable target for a local resolver: under 20ms. For cache misses requiring recursive resolution, P95 under 200ms.

P99: 1% of queries take this long or longer. This is your "worst typical" latency. Cloudflare 1.1.1.1 publishes P99 under 10ms. For a corporate recursive resolver, P99 under 100ms for cached queries is achievable. Above 500ms suggests a configuration or infrastructure problem.

Why P99 matters more than P50 for DNS:

A browser loading a modern webpage makes 20–100 DNS lookups (HTML, CSS, JS, images, analytics, fonts — each domain). If your P99 is 2 seconds, some users hit that on multiple sequential lookups. This compounds directly into TTFB (time to first byte) and page load time.

Google's research showed that a 400ms delay in search results reduced queries by 0.44%. For a service your users interact with repeatedly, P99 DNS latency is a measurable business metric.

Authoritative vs Recursive Latency

Measure both separately — they have different bottlenecks.

Authoritative latency: Time for an authoritative nameserver to answer a query from a resolver. Typically 1–10ms for in-memory zones. Rises with DNSSEC signing overhead (add 2–5ms) and database-backed zones (add backend query latency).

Recursive latency: Time for a recursive resolver to answer a client query. For cached responses: 1–5ms. For cache misses: sum of all authoritative queries needed in the resolution chain, plus network RTTs. A cache miss for a deeply delegated zone might chain 3–4 authoritative queries, each adding RTT.

Measuring them separately tells you whether your resolver is the problem or the authoritative infrastructure is.

TTFB Impact

Time to First Byte is the sum of TCP handshake + TLS handshake + server processing + first byte received. DNS resolution happens before this chain starts.

For a page load:

  1. DNS lookup: X ms
  2. TCP connect: ~RTT ms
  3. TLS handshake: ~1.5 RTT ms
  4. HTTP request: ~RTT ms

If your DNS P99 is 500ms and your server P99 TTFB is 200ms, the DNS is the bottleneck. On subsequent requests to the same domain, DNS is cached and TTFB dominates. But the first request per domain — the first time a user visits, the first load of a third-party script — is DNS-dominated.

Benchmarking Tools

dnsperf

dnsperf from Nominum/Akamai is the standard tool for DNS performance testing. It sends queries from a file, measures response times, and reports statistics.

Install:

apt install dnsperf  # Ubuntu/Debian
brew install dnsperf  # macOS

Create a query file (one query per line):

example.com A
www.example.com A
mail.example.com MX
example.com AAAA

Run a benchmark:

dnsperf -s 127.0.0.1 -d queries.txt -c 10 -t 30

Options:

  • -s target server
  • -d query data file
  • -c number of clients (concurrent connections)
  • -t test duration in seconds
  • -Q maximum QPS rate (useful for load testing without hammering)

Sample output:

DNS Performance Testing Tool
Version 2.11.2

[Status] Command line: dnsperf -s 127.0.0.1 -d queries.txt -c 10 -t 30
[Status] Sending queries (to 127.0.0.1:53)

Statistics:

  Queries sent:         450123
  Queries completed:    450119 (100.00%)
  Queries lost:         4 (0.00%)

  Response codes:       NOERROR 449834 (99.94%), NXDOMAIN 285 (0.06%)

  Average QPS:          15004.3 qps
  Average latency:      0.664 ms
  Latency std deviation: 1.234 ms
  Min latency:          0.044 ms
  Max latency:          45.234 ms

This doesn't give you P99 directly. To get percentiles, use -v flag or post-process with resperf.

resperf

resperf tests recursive resolver performance under increasing load, measuring latency at each load level. It ramps up QPS and shows you where latency starts to degrade.

resperf -s 127.0.0.1 -d queries.txt -r 60 -m 50000

-r ramp duration in seconds, -m maximum QPS. Output is a CSV you can plot — latency vs QPS shows you the inflection point where your resolver starts struggling.

flamethrower

flamethrower is a newer DNS benchmarking tool with better multithread support and more output formats:

# Install
cargo install flamethrower

# Run
flame -q 10000 -d 30 127.0.0.1 < queries.txt

It outputs percentile statistics (P50/P95/P99) directly.

Measuring from Multiple Locations

A benchmark from localhost measures the server's raw processing speed, not the latency clients experience. For a more realistic measurement, run dnsperf from a client machine that matches your user geography.

For external DNS (public authoritative servers), use synthetic monitoring tools:

  • DNSperf.com — global measurement from 200+ locations, publishes P50/P95/P99 for major providers
  • DNS Benchmark (Windows, GRC) — tests local resolver and public resolvers
  • dig from multiple locations (use public jump hosts or a cloud instance per region)

What Good Looks Like

Reference benchmarks from published data:

ProviderP50 (ms)P99 (ms)Source
Cloudflare 1.1.1.12–5< 10Cloudflare published
Google 8.8.8.85–15< 30DNSperf.com
OpenDNS 208.67.222.22210–25< 50DNSperf.com
Self-hosted (warm cache, local)1–3< 15Typical
Self-hosted (cache miss)50–200< 500Typical

What bad looks like and why:

P99 > 1 second: Resolver is saturated, backend queries are timing out, or network path has packet loss. Investigate SERVFAIL rate and upstream resolver health.

P50 > 50ms for cached queries: Cache is too small and evicting frequently, or resolver is under heavy CPU load from DNSSEC validation.

High variance (P99 >> 10x P50): Suggests cache misses are very expensive — either upstream authoritative is slow, or queries are hitting NXDOMAIN and not caching negatives.


Key Takeaways

  • Measure P99, not just P50. Median latency hides the tail behavior that affects user experience.
  • Cloudflare P99 under 10ms is the current benchmark for public resolver performance. Match it for your internal resolvers on cached queries.
  • dnsperf for raw throughput, resperf for load curve analysis, flamethrower for P-percentile output.
  • DNS latency directly adds to TTFB on the first request to each domain. At 30 domains per page, P99 DNS latency compounds.
  • Always benchmark before deploying infrastructure changes. Production should not be your benchmark environment.

Further Reading

Up Next

Case Studies: DNS Failures and Lessons Learned — four real incidents, what went wrong, and what changed afterward.