Module 9 · Lesson 3
Best Practices Synthesis
A consolidated reference of what matters most, organized by role: developer, ops/SRE, and domain manager. Pull this out when you're setting up something new or auditing something existing.
Best Practices Synthesis
Eight modules of DNS knowledge compressed into one reference. Not every recommendation applies to every situation — these are defaults, the positions you start from. Deviate from them when you have a reason, but know what you're deviating from.
Organized by role. Most people wear more than one of these hats.
For Developers
Don't Trust System Resolvers in Containers
When your application runs in a container (Docker, Kubernetes), the DNS resolver it uses is not the same as your workstation's resolver. It's a virtual resolver managed by the container runtime.
In Kubernetes, the default ndots:5 setting means your application will append the cluster's search domains before trying the bare hostname. A query for api.stripe.com inside a pod becomes:
api.stripe.com.default.svc.cluster.local→ NXDOMAINapi.stripe.com.svc.cluster.local→ NXDOMAINapi.stripe.com.cluster.local→ NXDOMAINapi.stripe.com→ answer
Three failed lookups before the real one. Multiply this by every external API call your application makes. At scale, this is measurable latency and unnecessary load on CoreDNS.
Fix it:
- Set
ndots: 1in your pod'sdnsConfig - Use fully qualified domain names (trailing dot) for external services in your service configuration
- Query CoreDNS metrics to see your NXDOMAIN rate before and after; the drop is usually significant
In Docker standalone, the resolver at 127.0.0.11 is Docker's embedded DNS. It works correctly for service discovery by container name, but it doesn't cache — every lookup goes to the upstream resolver. For applications making high-frequency DNS lookups, add application-level DNS caching.
TTL-Aware Caching in Your Application
If your application resolves a hostname at startup and caches the IP address indefinitely, it will break during DNS-based traffic migrations. This is a common source of outages during blue/green deployments.
The right approach: respect the TTL from the DNS response. Most HTTP client libraries can be configured to do this. In Java, the JVM has a networkaddress.cache.ttl property that defaults to caching forever — change it. In Node.js, the default dns.lookup() caches but doesn't respect TTLs; use dns.resolve() instead, or use a library that wraps it with TTL awareness.
For services you own: when you're about to make a DNS change, lower the TTL to 60-300s at least one TTL period before the change. Then change. Then raise TTL back to normal after you've confirmed the new records are serving correctly.
SRV Records for Service Discovery
Before you build a custom service discovery mechanism, check whether SRV records would solve the problem. SRV records encode hostname, port, priority, and weight in DNS:
_myservice._tcp.example.com. 300 IN SRV 10 5 8080 host1.example.com.
_myservice._tcp.example.com. 300 IN SRV 10 5 8080 host2.example.com.
_myservice._tcp.example.com. 300 IN SRV 20 5 8080 backup.example.com.
Clients that understand SRV records can discover the service, its port, and load balance across instances using the priority/weight values — without a custom service registry. This is how many protocols handle discovery natively (SIP, XMPP, some databases).
Kubernetes's internal DNS already uses SRV records for services. If you're discovering services within a cluster, the infrastructure is already there.
Email Auth Before Launch, Not After
Configure SPF, DKIM, and DMARC before your first production email goes out. Retroactively fixing email authentication means dealing with deliverability problems that have already affected real users.
The minimum required at launch:
# SPF — covers transactional email and bulk email providers
example.com. TXT "v=spf1 include:_spf.google.com include:sendgrid.net ~all"
# DKIM — CNAME to your email provider's selector (they give you this)
mail._domainkey.example.com. CNAME mail._domainkey.emailprovider.com.
# DMARC — start with p=none to collect data, move to p=quarantine after 2 weeks clean
_dmarc.example.com. TXT "v=DMARC1; p=none; rua=mailto:dmarc@example.com"
After two weeks of clean DMARC aggregate reports, move to p=quarantine. After another two weeks, p=reject. Don't skip steps.
DNS-Prefetch and Preconnect
For frontend performance: dns-prefetch tells the browser to resolve a hostname in the background before it's needed. preconnect goes further — it resolves, opens the TCP connection, and completes TLS negotiation.
<!-- Resolve early for third-party resources -->
<link rel="dns-prefetch" href="//cdn.example.com">
<!-- Full preconnect for critical resources -->
<link rel="preconnect" href="https://fonts.googleapis.com" crossorigin>
Use preconnect sparingly — it consumes browser connection resources. Use it for the 2-3 most critical external domains. Use dns-prefetch for everything else.
For Ops/SRE
Multi-Provider Authoritative DNS
This is not optional for production. A single DNS provider is a single point of failure. When Cloudflare, Route 53, or any other provider has a degraded event, your domain needs to remain resolvable.
The setup: two authoritative providers, zone transfers configured, both providers listed in your domain's NS records. Your domain registrar's NS records must point to nameservers at both providers.
Verify it works:
# Query each provider directly
dig @ns1.primaryprovider.com example.com A
dig @ns1.secondaryprovider.com example.com A
# Confirm both return the same answer
# Check SOA serial matches between providers
dig @ns1.primaryprovider.com example.com SOA
dig @ns1.secondaryprovider.com example.com SOA
The SOA serial must match or be within one increment between providers, depending on your zone transfer schedule. A serial mismatch means one provider is serving stale data.
TTL Strategy
Different operational states require different TTL postures:
| State | TTL | Reason |
|---|---|---|
| Normal operations (stable records) | 3600–86400s | Low resolver load, caching benefits |
| Pre-change (any DNS change planned) | 300s | Lower TTL 1 TTL period before change |
| Active change window | 60–300s | Fast propagation of new answers |
| Incident / rollback | 60s | Maximum flexibility to change quickly |
| Post-incident recovery | 300s → 3600s | Gradually raise after stability confirmed |
The critical mistake: making a DNS change without pre-lowering the TTL. If your A record has a 3600s TTL and you change it, some resolvers will serve the old answer for up to one more hour.
Registry Lock for Production Domains
EPP clientTransferProhibited is the basic transfer lock, set by default at most registrars. Not enough for production domains.
Registry lock (also called domain lock or server-side lock) applies at the registry level, not just the registrar level. It typically requires a manual process with multi-factor verification to remove — meaning an attacker who compromises your registrar credentials cannot initiate a transfer or nameserver change without the out-of-band confirmation.
For any domain that, if redirected or transferred, would cause a significant security incident: enable registry lock. This includes your primary brand domains, email-sending domains, and domains used for authentication flows (OAuth redirect URIs, verification domains).
Monitoring: What to Track
NXDOMAIN rate: Spike in NXDOMAIN responses from your internal resolver usually means a misconfigured service, a deployment that's querying names that don't exist yet, or (in a security context) malware attempting to resolve C2 domains.
SERVFAIL rate: Authoritative server problems. Can also indicate DNSSEC validation failures if you've made a signing error.
Resolution latency P95: Watch this at the resolver level, not just at the application level. DNS latency tends to be invisible until it isn't — the P95 rising slowly over weeks until something crosses a threshold and suddenly your application timeouts start.
SOA serial consistency: If you run multi-provider, automate a check that compares SOA serials across your providers. A persistent mismatch means zone transfers are broken.
Certificate Transparency log monitoring: Not DNS monitoring directly, but closely related — CT logs will show you when TLS certificates are issued for your domain or its subdomains. Unexpected certificates are a security signal. Services like crt.sh provide a free lookup; commercial monitoring services provide alerting.
Health Checks Are Not Monitoring
Route 53 health checks, Cloudflare health checks — these measure whether your origin is responding. They don't measure whether DNS resolution is working correctly globally. Run external DNS checks from multiple geographic locations using a monitoring service (Catchpoint, Datadog Synthetics, Pingdom) that specifically validates DNS resolution, not just HTTP response codes.
For Domain Managers
Tiered Registration Strategy
Not every domain variant needs to be registered. Trying to register every possible typosquat, every TLD variant, every plural form of your brand — this is expensive, unmanageable at scale, and ultimately ineffective.
A tiered strategy:
Tier 1 — Must own: Your primary brand TLDs (.com, .net, your country TLD), any TLD where you have significant user traffic or brand confusion risk. These get registry lock.
Tier 2 — Own defensively: High-traffic typosquats (e.g., if you're "yourcompany.com," probably own "youcompany.com" and "yourcampany.com"). Domains that would be problematic if a competitor or attacker owned them. These are worth the renewal fees.
Tier 3 — Monitor, don't register: The long tail of possible variants. Register these reactively (via UDRP if an attacker registers one) rather than proactively. The cost of monitoring is lower than the cost of owning thousands of domains that add administrative overhead with minimal protective value.
Never register: Domains registered as leverage against the primary brand's owner. This creates legal exposure without proportionate protection value.
Audit Before You Renew
Set a calendar reminder 60 days before any significant renewal batch. For each domain, ask:
- Is this domain actively used (traffic, email, app hosting)?
- Is it part of a redirect chain that's still relevant?
- Does anyone know why it was registered?
- Is it in Tier 1 or Tier 2 by the current strategy?
If the answer to all of the above is "no" or "I don't know," the default should be to drop it, not renew it. Unused registered domains are administrative overhead with no protective value.
UDRP Over Defensive Registration for the Long Tail
For Tier 3 variants that an attacker actually registers, a UDRP complaint is often more effective than preemptive registration of thousands of speculative variants. UDRP is cheaper than owning thousands of domains indefinitely, and the success rate for clear brand violations is high.
The UDRP three-part test: (1) the domain is identical or confusingly similar to a trademark you own, (2) the registrant has no legitimate interest in the domain, (3) the domain was registered and is being used in bad faith. All three must be established.
For phishing domains, all three are typically easy to establish. The main variables are timeline (UDRP takes 45-60 days) and whether you have registered trademark protection (unregistered common-law trademark can work, but registered trademark is easier to prove).
Keep Auth Codes Current
An auth code (EPP authorization code) is the password required to transfer a domain to a new registrar. Some registrars rotate these; some keep them static until you request a new one. If you've had a domain for several years and never retrieved the auth code, assume it needs to be regenerated.
At minimum: document your auth codes in a secure password manager, regenerate them annually for critical domains, and verify they work by checking with your registrar that the current auth code is valid. You don't want to discover during a time-critical migration that the auth code is expired or incorrect.
Key Takeaways
- Container DNS is not the same as system DNS. Fix
ndotsearly, before it becomes a latency problem. - TTL strategy is not just a setting — it's an operational discipline. Pre-lower before changes; don't change during incidents.
- Multi-provider authoritative DNS is the single most impactful reliability improvement you can make, and it's not expensive.
- Domain portfolios without a documented strategy drift toward unmanageable. Tiered registration keeps the scope bounded.
- Email auth is maintenance, not a project. Deploy it at launch, monitor DMARC reports, escalate the policy to
p=reject. Done.
Up Next
Lesson 4 — Emerging Trends Worth Watching: SVCB/HTTPS records, DNS over QUIC, the next gTLD round, and an honest take on what's actually worth your attention.