Troubleshooting DNS Issues: Tools and Techniques

Here is the universal DNS troubleshooting experience: something is broken, your colleague says "it might be DNS," you run nslookup, it gives you an answer, you declare DNS is fine, and twenty minutes later you discover the problem was indeed DNS.

The tools matter. The methodology matters more. This lesson covers both.

The Toolkit

dig — The Primary Instrument

dig is to DNS what curl is to HTTP. Everything else is optional.

Basic query:

dig example.com A

The flags people don't use:

+trace — follows the entire delegation chain from root servers down:

dig +trace example.com A

This shows you exactly how resolution unfolds. If resolution is failing, +trace tells you at which delegation it breaks.

+norecurse — sends the query with the RD (recursion desired) bit unset. The server won't recurse; it returns what it has:

dig +norecurse @8.8.8.8 example.com A

Useful for checking what a specific resolver has cached without triggering a new lookup.

+nocache / +cd (checking disabled) — bypass DNSSEC validation at the resolver:

dig +cd example.com A

If you get an answer with +cd but SERVFAIL without it, you have a DNSSEC issue.

+short — returns just the answer, no fluff:

dig +short example.com A

+time and +tries — control timeout and retry behavior:

dig +time=2 +tries=1 @192.168.1.1 example.com A

Set these tight when scripting — you don't want to wait 30 seconds per failed query.

Check SOA serial across nameservers:

for ns in ns1.example.com ns2.example.com; do
    echo -n "$ns: "
    dig +short @$ns example.com SOA | awk '{print $3}'
done

Mismatched serials = zone transfer problem.

Check all authoritative nameservers for a domain:

dig +short example.com NS | while read ns; do
    echo "=== $ns ==="
    dig +norecurse @$ns example.com A
done

drill

drill from ldns is particularly useful for DNSSEC debugging:

drill -D example.com A        # perform DNSSEC validation
drill -k /etc/trusted-key.key example.com DNSKEY  # validate with specific key

nslookup (and why dig is better)

nslookup is interactive, harder to script, doesn't show TTLs by default, and gives you less information than dig. The only reason to know it is that it's available on Windows systems where dig isn't installed. If you're debugging from a Windows machine and can't install dig, nslookup -type=A -debug example.com will at least tell you something.

On any Unix system: use dig.

tcpdump for DNS

When you want to see actual DNS traffic on the wire:

tcpdump -i eth0 -n port 53

More useful with output:

tcpdump -i eth0 -n -s 0 -w /tmp/dns.pcap port 53

Then open in Wireshark, or use tcpdump -r /tmp/dns.pcap -n to read it back.

The display filter in Wireshark for DNS is simply dns. For specific query types:

dns.qry.type == 1   # A records
dns.flags.rcode == 2  # SERVFAIL
dns.flags.rcode == 3  # NXDOMAIN

Wireshark DNS Filters Cheat Sheet

dns.flags.response == 0       # DNS queries only
dns.flags.response == 1       # DNS responses only
dns.flags.rcode != 0          # Non-zero rcodes (any error)
dns.qry.name contains "evil"  # Name contains string
dns.resp.ttl < 60             # Short TTL responses

Systematic Debugging Methodology

The mistake most engineers make is starting in the middle. They check their authoritative server, their records look fine, they declare DNS is fine. But the resolver the client is using has stale cache, or there's a misconfiguration at the registrar level, or the NS record delegation is broken.

Always start at the root and work down.

Step 1: Verify the Delegation

Check that your domain is delegated from the root correctly:

dig +trace example.com NS 2>&1 | head -40

Look for:

The root servers returning your TLD's nameservers
The TLD servers returning your domain's nameservers (the NS glue records)
Whether the NS records at the TLD match what you've configured at your registrar

Common failure: you updated your nameservers at your registrar but the change hasn't propagated to the root zone yet. Check the WHOIS and compare with what the TLD nameservers return.

Step 2: Verify the Authoritative Answer

Query your authoritative nameservers directly, bypassing any recursive resolver:

dig @ns1.example.com example.com A +norecurse

If this returns SERVFAIL: your authoritative server is broken. If this returns NXDOMAIN: the record doesn't exist on the authoritative server. If this returns the correct answer: the authoritative server is fine.

Step 3: Check the Resolver

Query the resolver your client is using:

dig @resolver-ip example.com A

Compare this to the authoritative answer. If they differ:

The resolver may have a stale cache (wait for TTL expiry)
The resolver may be poisoned
The resolver may be overriding with a "split horizon" response

To see what the resolver has cached:

dig +norecurse @resolver-ip example.com A

Step 4: Check the Client

On the failing system:

# Check which resolver the system is using
cat /etc/resolv.conf

# Check if it can reach the resolver
dig @$(awk '/^nameserver/{print $2; exit}' /etc/resolv.conf) example.com A

# Check the system's own resolution
getent hosts example.com

The gap between dig and getent hosts (or nslookup) is where the system resolver (libc, nsswitch) comes in. /etc/hosts can shadow DNS. MDNS can interfere. systemd-resolved has its own cache.

The "It's Always DNS" Checklist

When someone reports that "the site is down" and you suspect DNS:

[ ] Does the domain resolve from an external resolver? (8.8.8.8, 1.1.1.1)
[ ] Does it resolve from the client's own resolver?
[ ] Does the authoritative server return the correct answer?
[ ] Is the SOA serial the same across all NS records?
[ ] Is the TTL on the record what you expect?
[ ] Is the delegation at the registrar pointing to the correct nameservers?
[ ] Is DNSSEC validation passing? (dig +dnssec example.com)
[ ] Is there a firewall blocking UDP/TCP port 53 from the client?
[ ] Is there a negative cache entry in the resolver? (NXDOMAIN cached for a record you just created)
[ ] Is /etc/hosts on the client machine overriding DNS?
[ ] Is the certificate valid for the resolved IP? (rules out split DNS returning wrong IP)

A Real Troubleshooting Session

Here's what an actual DNS debugging session looks like. The report: "Our API is returning connection refused from some clients but not others."

# Start with the obvious
$ dig api.example.com A
;; ANSWER SECTION:
api.example.com. 300 IN A 198.51.100.10

# Check from public resolver
$ dig @8.8.8.8 api.example.com A +short
198.51.100.10
198.51.100.11

# Wait — we got two IPs from Google but only one from our internal resolver
# Check authoritative directly
$ dig @ns1.example.com api.example.com A
;; ANSWER SECTION:
api.example.com. 300 IN A 198.51.100.10
api.example.com. 300 IN A 198.51.100.11

# Both IPs from authoritative. Check internal resolver again
$ dig @internal-resolver.corp api.example.com A
;; ANSWER SECTION:
api.example.com. 300 IN A 198.51.100.10

# Internal resolver only has one record. Check the other internal resolver
$ dig @internal-resolver2.corp api.example.com A
;; ANSWER SECTION:
api.example.com. 300 IN A 198.51.100.10
api.example.com. 300 IN A 198.51.100.11

# So resolver1 is only returning one record. TTL is 300 — probably stale cache.
# Check when this was added
$ dig @ns1.example.com api.example.com A | grep "SERIAL\|SOA"
# Actually check SOA serial
$ dig @ns1.example.com example.com SOA +short
ns1.example.com. admin.example.com. 2024031501 3600 900 604800 300

# Record was added in serial 2024031501. Check resolver1's cache age.
# Just wait for TTL expiry (300 seconds) or flush if you control the resolver.
$ unbound-control flush api.example.com
ok

# Verify
$ dig @internal-resolver.corp api.example.com A +short
198.51.100.10
198.51.100.11

Total time: 8 minutes. The "connection refused from some clients" was load balancing between two IPs where one had a firewall issue — but finding that required first eliminating the DNS inconsistency.

Failure Pattern: Lame Delegation

Lame delegation is worth calling out specifically because it's one of the most common root causes of SERVFAIL errors, and it's not obvious from the symptoms.

A lame delegation happens when the parent zone (the TLD registry) has NS records pointing to nameservers that don't actually serve the zone. The nameservers exist and are reachable, but when a resolver asks them for records, they return SERVFAIL or REFUSED because they have no data for the zone.

The most common causes:

Nameserver change at the registrar, zone data not migrated: You moved from provider A to provider B, updated the NS records at the registrar, but forgot to import the zone data into provider B.
Secondary nameserver not configured: You listed ns2.example.com in your registrar's NS records, but ns2 was never set up to serve the zone.
Zone expired on secondary: The secondary hasn't successfully transferred from the primary in longer than the SOA expire value, so it stopped serving the zone.

How to diagnose:

# Step 1: What nameservers does the TLD think you have?
dig example.com NS @a.gtld-servers.net +short

# Step 2: Query each of those nameservers directly
for ns in $(dig example.com NS @a.gtld-servers.net +short); do
    echo "=== $ns ==="
    dig @$ns example.com SOA +norecurse
done

If any nameserver returns SERVFAIL, REFUSED, or a NOERROR with an empty answer (no SOA in the authority section), it's lame for that zone.

A fully lame delegation — all nameservers lame — produces SERVFAIL from any recursive resolver. A partially lame delegation is worse: resolution works most of the time (when queries hit the non-lame servers) and fails intermittently (when queries hit the lame ones). Intermittent SERVFAIL is harder to diagnose than consistent SERVFAIL.

The fix depends on the cause: import the zone, configure the secondary, or remove the lame nameserver from your NS records and registrar delegation.

Key Takeaways

dig +trace is your first tool when investigating resolution failures. Start at the root.
dig +norecurse shows you what a resolver has cached without triggering a new lookup.
Always verify: root delegation, then authoritative answer, then resolver answer, then client resolution.
The gap between the authoritative answer and the resolver answer is where most DNS bugs live.
tcpdump port 53 shows you ground truth — what's actually happening on the wire.
Mismatched SOA serials across nameservers mean a zone transfer problem, not a record problem.

Up Next

DNS Scalability: Handling High-Volume Traffic — what happens when the query rate exceeds what your current architecture can handle.