Module 1 · Lesson 4

DNS Resolution Process: From Query to Answer

13 min read

dnsresolutionrecursiveiterativecachingTTL

DNS Resolution Process: From Query to Answer

When you type github.com into your browser, somewhere between 3 and 10 DNS queries happen before your browser sends its first HTTP request. Here's every step.

The Cast

Before walking through the process, identify the players:

Stub resolver: The DNS client on your machine. It's part of the OS networking stack. When a process calls getaddrinfo(), the stub resolver handles it. The stub resolver does almost nothing — it reads /etc/resolv.conf, finds the configured recursive resolver, forwards the query there, and returns the response.

Recursive resolver (also called "full-service resolver"): This is where the actual work happens. It receives queries from stub resolvers, traverses the DNS hierarchy to find answers, caches results, and returns final answers. Your ISP runs one. So do Google (8.8.8.8), Cloudflare (1.1.1.1), Quad9 (9.9.9.9).

Authoritative nameservers: Hold zone data. They give definitive answers for their zones and nothing else.

The Full Query Journey

Let's trace github.com A from scratch, assuming no cache anywhere.

Step 1: Application to Stub Resolver

Your browser calls getaddrinfo("github.com"). The OS stub resolver checks its local cache first (yes, your OS caches DNS too). Nothing there. It forwards the query to the configured recursive resolver, say 1.1.1.1.

Step 2: Recursive Resolver Checks Cache

The recursive resolver at 1.1.1.1 checks its cache. If another client recently queried github.com, it has a cached answer with some TTL remaining. It returns that immediately. Done.

If not, it starts the resolution process.

Step 3: Root Query

The resolver needs to start somewhere. It knows the root servers — either from its cache (root NS records have very long TTLs) or from its built-in root hints file. It picks one, say a.root-servers.net (198.41.0.4), and sends:

Query: github.com. A
To: 198.41.0.4 (root)

The root server doesn't know the answer. It knows who does: the .com TLD servers. It returns a referral:

Authority section:
com.    172800    IN    NS    a.gtld-servers.net.
com.    172800    IN    NS    b.gtld-servers.net.
# ... through j.

Additional section (glue):
a.gtld-servers.net.    172800    IN    A    192.5.6.30
b.gtld-servers.net.    172800    IN    A    192.33.14.30
# ...

This is a referral response. The resolver caches the .com NS records and proceeds.

Step 4: TLD Query

The resolver picks a .com TLD nameserver and sends the same query:

Query: github.com. A
To: 192.5.6.30 (a.gtld-servers.net)

The TLD server knows about github.com's delegation. It returns another referral:

Authority section:
github.com.    172800    IN    NS    ns-1622.awsdns-10.co.uk.
github.com.    172800    IN    NS    ns-1283.awsdns-32.org.
github.com.    172800    IN    NS    ns-421.awsdns-52.com.
github.com.    172800    IN    NS    ns-692.awsdns-22.net.

The resolver caches this delegation and proceeds.

Step 5: Authoritative Query

The resolver picks one of GitHub's nameservers and sends the query:

Query: github.com. A
To: ns-421.awsdns-52.com

This time it gets an authoritative answer:

Answer section:
github.com.    60    IN    A    140.82.121.4

The aa (Authoritative Answer) flag is set in the response. This is the answer from the source.

Step 6: Cache and Return

The resolver caches github.com A 140.82.121.4 with a TTL of 60 seconds. It returns the answer to the stub resolver. The stub resolver returns it to getaddrinfo(). Your browser opens a TCP connection to 140.82.121.4.

Total elapsed time: 30-100ms, depending on geography. Three queries to three different servers. Barely perceptible.

Recursive vs Iterative

The difference is about who does the work.

Recursive: The client asks the resolver, the resolver does all the work, the client gets a final answer. This is what happens in the scenario above — your stub resolver makes one recursive query to 1.1.1.1, and 1.1.1.1 does the iteration.

Iterative: The client gets a referral instead of an answer, and the client itself follows the referral chain. When a recursive resolver queries root, the root responds iteratively — it says "I don't know, but try these TLD servers." The recursive resolver follows up.

In practice: stub resolvers query recursively. Recursive resolvers query iteratively against authoritative servers. Authoritative servers respond iteratively.

You can force dig to query iteratively:

# Normal recursive query (resolver does the work)
dig github.com A @1.1.1.1

# Iterative query (don't follow referrals)
dig github.com A @1.1.1.1 +norecurse

# Full trace (dig itself follows the iterative chain)
dig github.com A +trace

Caching at Every Level

Caching is what makes DNS scale. Without it, every query would hit the root servers. With it, most queries are answered from local cache.

Where caching happens:

  • The recursive resolver (most impactful — serves thousands or millions of clients)
  • The OS stub resolver / nscd (serves one machine, all processes)
  • Some applications (browsers, JVM, etc.) have their own DNS caches

What controls cache duration: TTL (Time To Live). Every DNS record has a TTL in seconds. The resolver decrements this as time passes. When TTL reaches 0, the cached entry is expired and must be re-fetched.

# Watch the TTL count down — run this twice, 5 seconds apart
dig github.com A +short
# 140.82.121.4

dig github.com A
# ;; ANSWER SECTION:
# github.com.    55    IN    A    140.82.121.4
# (5 seconds later, TTL decreased from 60 to 55)

TTLs and Why They Actually Matter

TTL is a trade-off between performance and propagation speed.

High TTL (hours to days):

  • Faster resolution (cache hits more often)
  • Less load on authoritative servers
  • Changes take longer to propagate

Low TTL (seconds to minutes):

  • Changes propagate quickly
  • More queries hit authoritative servers
  • Slightly higher latency for cache misses

The conventional wisdom: set a long TTL for stable records, drop it low before making changes, then restore it after.

Practical workflow for a migration:

  1. Current TTL: 3600 (1 hour). Change it to 300 (5 minutes), one TTL period before your migration window.
  2. Wait for the old 1-hour TTL to expire everywhere (wait at least one hour).
  3. Make your change.
  4. Propagation takes at most 5 minutes now.
  5. After confirming the migration, set TTL back to 3600.

If you don't do step 2, you get a mixed period where some resolvers serve the old address (from their hour-long cache) while others serve the new one. This is how migrations go wrong at 2am.

Negative Caching

Not just successful lookups are cached. NXDOMAIN responses (name doesn't exist) are also cached, using the SOA minimum TTL (the last field in the SOA record).

dig nonexistent.github.com A
# status: NXDOMAIN
# ;; AUTHORITY SECTION:
# github.com.    900    IN    SOA    ns-1622.awsdns-10.co.uk. ...

The SOA record appears in the authority section for NXDOMAIN responses. The TTL on the SOA (900 seconds here) tells resolvers how long to cache the negative result. If you're adding new subdomains, negative caching can cause a brief period where the new name doesn't resolve even after you've added the record.

What Actually Happens When You Type google.com

Full picture, including layers often omitted:

  1. Browser checks its own DNS cache
  2. Browser calls getaddrinfo("google.com")
  3. OS checks /etc/hosts
  4. OS stub resolver checks its cache
  5. Stub resolver sends query to configured resolver (from /etc/resolv.conf)
  6. Resolver checks its cache → probably a hit (google.com is queried millions of times per minute globally)
  7. If cache miss: resolver queries root, gets .com referral, queries .com, gets google.com referral, queries google's authoritative nameservers, gets A record
  8. Response flows back to stub resolver, to getaddrinfo(), to browser
  9. Browser initiates TCP connection to the returned IP

Total elapsed time from step 1 to step 9: under a millisecond for a cache hit, 30-100ms for a full recursive resolution.

Debugging Resolution Problems

When DNS breaks, here's the quick diagnostic flow:

# Check what your stub resolver is doing
cat /etc/resolv.conf

# Query your configured resolver
dig github.com @$(grep nameserver /etc/resolv.conf | head -1 | awk '{print $2}')

# Query a known-good public resolver to compare
dig github.com @1.1.1.1
dig github.com @8.8.8.8

# Trace the full resolution chain from root
dig +trace github.com

# Check the authoritative nameservers directly
dig github.com NS +short | head -1 | xargs -I{} dig github.com A @{}

# Verify NXDOMAIN vs SERVFAIL (different failure modes)
# NXDOMAIN = name doesn't exist (authoritative response)
# SERVFAIL = server failure (usually means delegation is broken or nameserver is unreachable)

The SERVFAIL vs NXDOMAIN distinction matters: NXDOMAIN means the zone is working, just the name doesn't exist. SERVFAIL usually means the zone itself has a problem — lame delegation, unreachable nameservers, DNSSEC validation failure.


Key Takeaways

  • Resolution is typically 3 queries: root, TLD, authoritative. Caching short-circuits most of them.
  • Stub resolvers query recursively. Recursive resolvers query iteratively. Authoritative servers respond iteratively.
  • TTL controls caching duration. Lower it before changes, not at the moment of change.
  • Negative caching (NXDOMAIN) is real — new records may take up to the SOA minimum TTL to become visible.
  • SERVFAIL = zone problem. NXDOMAIN = name doesn't exist. Different causes, different fixes.

Further Reading

Up Next

You've seen what travels over the wire. Next: what the wire itself looks like — DNS protocols, packet formats, and why UDP seemed like a fine idea in 1987.