DNS Monitoring and Logging Best Practices

Most teams have DNS monitoring. Most of those teams have a dashboard that someone set up three years ago with no alert thresholds, that nobody looks at until the incident is already in progress.

This lesson is about building monitoring that actually works: specific metrics, specific thresholds, and enough log data to answer "when did this start?" during a 3am incident.

The Metrics That Actually Matter

Not all DNS metrics are created equal. Here's what to track and why.

Query Rate (QPS)

What it is: Queries per second, ideally broken down by query type (A, AAAA, MX, TXT, etc.) and by resolver or authoritative server.

Why it matters: A sudden spike in QPS can mean a DDoS, a misconfigured application making recursive queries in a loop, or a bot network warming up. A sudden drop can mean your resolver is unreachable or clients have stopped routing to it.

Alert threshold: Alert on deviation from baseline. If your normal is 1,000 QPS and you're seeing 10,000 QPS or 100 QPS, something is wrong. A good rule is: alert when QPS exceeds 3x the 30-minute rolling average or drops below 20% of baseline.

NXDOMAIN Rate

What it is: Percentage of queries returning NXDOMAIN (non-existent domain).

Why it matters: Normal NXDOMAIN rate for a corporate resolver is 2–8%. Above 15% usually means one of: malware doing DNS beaconing to randomly-generated domains (DGA), a misconfigured application querying non-existent names, or a botnet. Above 40% is almost certainly malicious activity.

Alert threshold: Alert when NXDOMAIN rate exceeds 15% sustained over 5 minutes.

SERVFAIL Rate

What it is: Percentage of queries returning SERVFAIL (server failure — the resolver couldn't get an authoritative answer).

Why it matters: SERVFAIL means your recursive resolver couldn't reach the authoritative server, got a malformed response, or encountered a DNSSEC validation failure. A baseline SERVFAIL rate should be below 1%. Above 5% indicates a connectivity problem or misconfiguration. During the Dyn outage in 2016, SERVFAIL rates for .com/.net queries spiked globally.

Alert threshold: Alert when SERVFAIL rate exceeds 3% for more than 2 minutes.

Response Time by Resolver

What it is: P50, P95, and P99 resolution latency — the time from query sent to answer received.

Why it matters: P50 tells you what a typical user experiences. P99 tells you what your worst-affected users experience. For a well-configured local resolver with warm cache, P50 should be under 5ms and P99 under 50ms for cached responses. For cache misses requiring recursive resolution, P99 under 500ms is reasonable.

Alert threshold: Alert when P99 latency exceeds 500ms for more than 5 minutes.

Cache Hit Ratio

What it is: Percentage of queries answered from cache vs. requiring recursive lookup.

Why it matters: Below 70% suggests your resolver is underconfigured (see Lesson 01) or that something unusual is happening with your query mix.

Alert threshold: Alert when cache hit ratio drops below 60% sustained over 10 minutes.

Tooling Stack

Prometheus + dnsdist

If you're using dnsdist as a DNS load balancer or query router (common with PowerDNS setups), it has a built-in Prometheus metrics endpoint:

# dnsdist.conf
webserver("0.0.0.0:8083", "your-password")
setWebserverConfig({statsRequireAuthentication=false})

Then Prometheus scrapes http://dnsdist-host:8083/metrics. You get:

dnsdist_queries_total
dnsdist_cache_hits_total
dnsdist_servfail_responses_total
dnsdist_nxdomain_responses_total
Per-pool and per-server latency histograms

For unbound, use the unbound_exporter:

docker run -p 9167:9167 \
  -e UNBOUND_HOST=tcp://unbound:8953 \
  mvance/unbound-exporter

For BIND, use bind_exporter from the Prometheus community.

Example Grafana Alert Rules (Prometheus)

groups:
  - name: dns-alerts
    rules:
      - alert: HighNXDOMAINRate
        expr: |
          rate(dnsdist_nxdomain_responses_total[5m]) /
          rate(dnsdist_queries_total[5m]) > 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "NXDOMAIN rate above 15%"
          description: "{{ $value | humanizePercentage }} NXDOMAIN rate on {{ $labels.instance }}"

      - alert: HighSERVFAILRate
        expr: |
          rate(dnsdist_servfail_responses_total[2m]) /
          rate(dnsdist_queries_total[2m]) > 0.03
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "SERVFAIL rate above 3%"

      - alert: LowCacheHitRate
        expr: |
          rate(dnsdist_cache_hits_total[10m]) /
          rate(dnsdist_queries_total[10m]) < 0.60
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 60%"

Datadog DNS Monitoring

If you're on Datadog, the DNS check is included in the agent and requires minimal configuration:

# conf.d/dns_check.d/conf.yaml
init_config:

instances:
  - hostname: www.yourdomain.com
    nameserver: 8.8.8.8
    timeout: 5
    record_type: A

  - hostname: mail.yourdomain.com
    nameserver: your-internal-resolver
    timeout: 2
    record_type: MX

This creates dns.response_time and dns.can_resolve metrics that you can alert on directly.

Log Analysis: What a DNS Query Log Actually Tells You

A typical DNS query log line looks like this (unbound with verbosity 3):

[1709123456] unbound[1234:0] info: 192.168.1.45 api.stripe.com. A IN
[1709123456] unbound[1234:0] info: resolving api.stripe.com. A IN
[1709123457] unbound[1234:0] info: 192.168.1.45 api.stripe.com. A IN: 104.18.20.94 0.042023

From this you can extract:

Client IP (192.168.1.45) — which host made the query
Query name (api.stripe.com) — what they were looking for
Query type (A IN)
Resolution time (0.042023 seconds — this was a cache miss, took 42ms)
Answer (104.18.20.94)

For structured logging, dnsdist can output JSON:

-- dnsdist.conf
addResponseAction(AllRule(), LogResponseAction("/var/log/dnsdist/responses.log", false, true))

Then parse with jq or ship to your log aggregator (Loki, Elasticsearch, Splunk).

Anomaly Detection with Log Data

Simple anomaly detection you can do with log data:

Top 10 queried domains (by client): Identifies noisy clients or potential DGA activity:

awk '{print $5, $3}' /var/log/unbound/queries.log | \
  sort | uniq -c | sort -rn | head 20

Query rate per client (5-minute window): Flags anything generating unusual volume:

awk '{print $3}' /var/log/unbound/queries.log | \
  sort | uniq -c | sort -rn | awk '$1 > 1000'

NXDOMAIN ratio per client: Finds hosts with high failure rates:

grep NXDOMAIN /var/log/unbound/queries.log | \
  awk '{print $3}' | sort | uniq -c | sort -rn

If you're shipping logs to Grafana Loki, a LogQL query for SERVFAIL rate looks like:

sum(rate({job="dnsdist"} |= "SERVFAIL" [5m])) /
sum(rate({job="dnsdist"} [5m]))

External Monitoring

Don't only monitor from inside your network. Use external synthetic monitoring to verify what the rest of the internet sees.

Tools:

UptimeRobot (free tier) — basic DNS check, confirms your authoritative servers respond
NS1 Pulsar / Catchpoint — distributed DNS monitoring from hundreds of vantage points
dnscheck.tools — ad-hoc manual checks from multiple locations

The simplest external check is a synthetic monitor that resolves a known hostname against your authoritative servers every 60 seconds from multiple regions and alerts if the answer is wrong or unavailable.

Key Takeaways

The four metrics that matter most: QPS, NXDOMAIN rate, SERVFAIL rate, and cache hit ratio. Everything else is secondary.
Alert thresholds: NXDOMAIN > 15%, SERVFAIL > 3%, cache hit ratio < 60%, P99 latency > 500ms.
dnsdist + Prometheus + Grafana is the current standard stack for self-hosted DNS monitoring.
Log data tells you who asked what — invaluable during incident investigation.
External synthetic monitoring catches the problems that internal monitoring can't see.

Up Next

Troubleshooting DNS Issues: Tools and Techniques — when the metrics say something is wrong, here's how to find out what.