Real-World Case Studies: DNS in Production

Abstract DNS knowledge is easy to forget. Watching it break in production is harder to forget.

These three case studies draw on the full scope of the course. They're not sanitized success stories. Each one includes the mistake, the moment someone realized something was wrong, and what changed afterward. Read them as walkthroughs, not cautionary tales — the goal is pattern recognition, not schadenfreude.

Case 1: Launching a SaaS Product's DNS from Scratch

The Setup

A four-person developer team is building a B2B SaaS product — project management tooling for architecture firms. They have a domain, a cloud provider, and a launch date six weeks out. DNS is on the list somewhere after "finish the auth system" and "set up billing."

This is the wrong order. They're about to find out why.

The Decisions, and How They Made Them

Registrar choice. They went with the registrar they'd used for personal projects — a budget registrar that offered cheap renewals but no API, no registry lock, and support that runs through a ticket queue. For a personal blog, fine. For a production SaaS with paying customers, this is a liability they didn't price in.

The right move: a registrar with a proper API (for automation), EPP status codes they can actually see, and 2FA enforcement. Cloudflare Registrar, Gandi, or a managed DNS provider like EasyDNS would all have been defensible choices. The price difference over a year is less than one hour of downtime cost.

Nameserver setup. They used the cloud provider's default nameservers (AWS Route 53). One provider, four nameservers all in the same logical pool. This works until Route 53 has a degraded event — which has happened, and will happen again.

What they should have done: secondary DNS at a second provider. Route 53 as primary, Cloudflare or NS1 as secondary, zone transfers configured. This costs almost nothing and means a provider outage becomes a monitoring event rather than a customer-facing outage.

Record architecture. They set up app.saasproduct.com pointing to a load balancer, www.saasproduct.com as a CNAME to app, and the apex domain with an ALIAS record to the same load balancer. They created an api.saasproduct.com subdomain. So far, reasonable.

What they missed: no _dmarc TXT record, no DKIM selector records, no explicit SPF record. They planned to "add email later." Later was week two after launch when they enabled transactional email and their first batch of welcome emails landed in spam for 30% of recipients.

TTL strategy. Default 300s on everything. Not bad, but not thought through. A proper TTL strategy for a new launch means:

Keep TTLs at 300s during the first 30 days while the architecture is still shifting
Once stable, move static records (MX, NS, TXT) to 3600s or 86400s
Keep A/CNAME records for the app at 300s until you've done your first intentional failover test

Monitoring. They set up uptime monitoring on the app URL. No DNS-specific monitoring. No alerts on resolution failures. No check that their authoritative servers were responding correctly from multiple geographic locations.

What Bit Them Six Months Later

Three things, in order of severity:

Email deliverability. Two months in, a customer reported that all emails from the product were being marked as spam by their corporate email gateway. Investigation: SPF record was misconfigured (they'd added a bulk email provider without updating SPF), DMARC was set to p=none and they'd never looked at the reports. The DMARC aggregate reports had been showing alignment failures for weeks. Nobody was reading them.

Fix: Proper SPF with ~all, DKIM configured through the email provider, DMARC moved to p=quarantine after two weeks of clean reports, then p=reject. Deliverability recovered within a week of the DNS propagation, but they lost some customer trust in the interim.

Subdomain takeover. Six months in, they deprecated an old staging environment at staging.saasproduct.com. The CNAME record pointing to the cloud provider's load balancer stayed in DNS. The load balancer was deprovisioned. Someone else claimed that hostname on the same cloud provider. For three weeks, staging.saasproduct.com pointed to infrastructure they didn't control.

Nobody noticed because nobody was monitoring it. A security researcher noticed it instead and reported it. They got lucky.

Fix: Audit DNS records against live infrastructure on a schedule. When you decommission infrastructure, remove its DNS records the same day.

Single-provider risk. Eight months in, Route 53 had a degraded event in their primary region. Their application was mostly unaffected (it stayed up via Route 53's health checks routing traffic to a secondary region), but resolution latency spiked globally for about 40 minutes. Customers in Europe noticed.

Fix: They finally set up Cloudflare as a secondary authoritative. Zone transfers configured. Tested. Monitoring confirmed both providers were serving correct answers. Two weeks of work they should have done at launch.

What They Got Right

The record architecture was sound. They used ALIAS records correctly for the apex domain. The TTL strategy was at least defensible. They had uptime monitoring.

The failure pattern is common: DNS security and email auth treated as optional extras rather than launch requirements. Both of those decisions cost them more time to fix than they would have cost to get right initially.

Case 2: Migrating a 500-Domain Portfolio Between Registrars

The Context

A regional media company had accumulated 500+ domains over fifteen years — a mix of primary brands, regional variants, defensive registrations, expired redirects, and "we registered this in 2009 and nobody knows why" domains. They were migrating from a legacy registrar that was shutting down their SMB product line to a new platform.

The migration was assigned to a three-person ops team with a six-week window before the legacy platform's shutdown date.

The Right Sequence

Most teams get this backwards. They think "domain transfer" means moving the domain registration, and they do it first. This is wrong.

The correct order:

Audit the portfolio
Migrate DNS (change nameservers, copy zone files)
Validate DNS at the new provider
Lower TTLs
Transfer the registrar registration
Validate post-transfer
Restore TTLs

Steps 1 and 2 are done at the old registrar. Steps 3-7 span the transition. The reason: if the registrar transfer goes wrong (and it will, for some domains), your DNS is already safe at the new provider. You haven't tied a DNS failure to a registrar failure.

Step 1: Audit first.

The team spent the first week doing nothing but auditing. They built a spreadsheet with:

Domain name
Registrar (some domains had already been moved to a third registrar years earlier — nobody had updated the internal inventory)
Expiry date
Nameservers (authoritative provider)
Primary use: active brand / redirect / defensive / unknown
EPP transfer lock status
Auth code availability

What they found: 47 domains had expired or were within 30 days of expiry. 23 domains were already at a different registrar than the one being migrated. 61 domains had no DNS records at all — just parked at the registrar. 12 domains had transfer locks that needed manual support tickets to unlock.

None of this was in the internal documentation because there was no internal documentation. The audit itself was valuable independent of the migration.

Step 2: DNS migration first.

Before transferring a single domain registration, they set up the zone files at the new DNS provider. They used OctoDNS to export zone data from the old provider's API and import it into the new one, with a dry-run first to catch any record type mismatches.

Most records transferred cleanly. Exceptions:

CAA records: the old provider stored them in a non-standard format
ALIAS/ANAME records: the old provider used a proprietary record type; the new provider called it something different
Empty zones: domains with no DNS records needed placeholder SOA/NS records

Step 3: Staging validation.

Before lowering any TTLs, they validated the new nameservers by querying them directly:

# Query new nameservers directly, bypassing the active delegation
dig @ns1.newprovider.com example.com A
dig @ns1.newprovider.com example.com MX
dig @ns1.newprovider.com example.com TXT

They wrote a simple script that ran this check for all 500 domains against both old and new nameservers, diffing the output. 23 domains had discrepancies. Most were minor (TTL differences). Three had missing records that needed to be added manually.

Step 4: TTL lowering.

Three days before starting registrar transfers, they lowered TTLs on all active domains to 300s. This meant that once the NS delegation changed, propagation would complete quickly. For the 61 parked domains with no real traffic, they skipped this step.

Step 5: Transfer coordination.

Registrar transfers have mandatory windows — ICANN requires a 5-day period during which the losing registrar can deny the transfer. With 500 domains, this meant staggering transfers in batches of 50, validating each batch before starting the next.

The team transferred 50 domains per day. Each domain required:

Unlock EPP transfer lock at old registrar
Retrieve auth code
Initiate transfer at new registrar
Confirm transfer (some registrars send a confirmation email to the registrant address; if that address bounces, the transfer stalls)

The confirmation email problem was real. About 30 domains had registrant addresses pointing to abandoned email addresses or generic mailboxes nobody checked. Each one needed manual support tickets at the old registrar to push through.

What Takes Longer Than Expected

Everything involving human support queues. Unlocking transfer locks for domains with registry-level locks (clientTransferProhibited set server-side) required support tickets at both registrar and registry level for some ccTLDs. For .fr domains, the process was different from .de domains, which was different from .com. Having someone on the team who knew which ccTLDs had quirks saved significant time.

Auth codes expire. If a domain sits in "waiting for transfer" state for more than a few days and the auth code expires, you start over. Some registrars generate new auth codes that expire in 24 hours. Coordinate batch sizes accordingly.

The One Domain That Always Breaks

In every large migration, there's one domain that fights you. In this case, it was a .es domain that had a registry-level transfer lock, an expired registrant email address, a WHOIS record showing incorrect contact information, and an auth code that had been manually overridden by the old registrar's system to a non-standard format.

The .es registry (Red.es) had specific requirements for transfer authorization that differed from the standard ICANN EPP flow. The team spent four working days on this one domain. The resolution required a phone call to the registry, a formal letter from the domain owner, and a 48-hour processing window.

Lesson: identify your unusual ccTLDs before the migration starts. Know their specific transfer rules. Budget extra time.

Post-Transfer Validation

After each batch transferred, the team ran the same validation script used in staging — querying authoritative servers, checking SOA serials, verifying MX and TXT records were intact. They also monitored NXDOMAIN rates from external monitoring during the transfer window to catch any domains that had gone dark.

Three domains had DNS gaps during transfer — the new nameservers weren't serving records correctly for a short window. All three were caught by monitoring within 10 minutes and corrected.

Final State

The migration took nine weeks instead of six. The two extra weeks were absorbed by the ccTLD complications and the confirmation email backlog. All 500 domains migrated successfully. The 61 parked domains were reviewed: 22 were dropped at renewal, 39 were kept with a documented reason.

The OctoDNS configuration from the migration became the team's ongoing DNS management system. The audit spreadsheet became the portfolio registry they'd never had.

Case 3: Responding to a DNS-Based Security Incident

The Threat

A mid-sized financial services firm — not a bank, but an investment advisory business with around 40,000 active clients — discovered that a domain very similar to their brand had been registered and was hosting a phishing site. The phishing domain used a typosquat variant: firmname-secure.com where the legitimate domain was firmname.com.

The phishing site was a pixel-perfect copy of their client portal login page. It was indexed in Google. It had a valid TLS certificate (Let's Encrypt, automatically issued). Two clients had already submitted support tickets saying they couldn't log in — they'd been entering credentials on the phishing site.

This is a real scenario. The techniques are standard. The response is the part that varies.

How It Was Detected

The firm had passive DNS monitoring in place — a subscription to a service (in this case, DomainTools Iris) that alerts on newly registered domains matching brand terms and typosquat patterns. The alert fired within 18 hours of the phishing domain's registration.

Without that monitoring, detection would have depended on clients reporting problems. Some phishing campaigns run for weeks before detection via client reports.

The firm also had DMARC reporting configured (p=reject). The phishing domain was not sending email under the firm's domain, so DMARC didn't catch it directly — but the monitoring infrastructure that DMARC reporting had justified was the same infrastructure that caught the passive DNS alert.

Key lesson from Module 8: passive DNS monitoring and DMARC reporting aren't separate concerns. They're both part of the same defensive posture. Teams that configure one tend to have the infrastructure to configure both.

The Response Timeline

Hour 0: Alert fires. Passive DNS monitoring flags firmname-secure.com as newly registered.

Hour 2: Security team verifies the domain is live and hosting a phishing copy. Screenshots taken. WHOIS data captured (privacy-protected, but registrar and registration date visible). Certificate transparency logs queried to confirm TLS cert issuance date.

Hour 4: Abuse report filed with the registrar. Most registrars have a dedicated abuse reporting path; some respond within hours for clearly evidenced phishing. This registrar's SLA was 24-48 hours.

Hour 6: Abuse report filed with hosting provider. IP address identified via DNS; hosting provider identified via IP WHOIS. Separate abuse report to the hosting provider, which in this case responded faster than the registrar.

Hour 8: Notification to clients via official channels: email from the legitimate domain, in-app notification, social media post. Message: "We are aware of a phishing site. Here is our legitimate URL. We will never ask you to click an email link to log in." Client support volume increased for 48 hours.

Hour 12: UDRP filing process started. The firm's outside counsel handled this. A UDRP complaint requires: evidence that you own a trademark in the name, evidence that the domain is identical or confusingly similar to that trademark, evidence that the registrant has no legitimate interest in the domain, and evidence of bad faith registration. All four were provable.

Hour 36: Hosting provider took the phishing site offline. The domain remained registered but the content was gone.

Day 5: Registrar suspended the domain following the abuse report.

Day 45: UDRP decision issued. Complaint upheld. Domain transferred to the firm.

What Changed Afterward

Monitoring expanded. The firm added broader coverage — more brand variant patterns, monitoring for newly issued TLS certificates mentioning their brand name (via Certificate Transparency Log monitoring), and alerting on domains registering with their trademark within the name.

Client communication protocol written. The incident exposed that there was no pre-written template for client notifications about phishing. The two-hour delay between detection and client notification was partly because the team was writing the notification from scratch while also managing the technical response. Now there's a template, an approval chain, and a defined notification window.

UDRP preparedness improved. The firm's counsel prepared a standard evidence package that could be adapted quickly for future UDRP filings, rather than building from scratch each time. They also registered the most obvious typosquat variants defensively — not exhaustively, but the high-risk ones.

One thing that didn't change: the firm decided not to pursue criminal referral to law enforcement. The attacker was likely operating from a jurisdiction where prosecution would be impractical. The practical outcome — domain takedown and transfer — was achieved through the abuse reporting and UDRP path. Criminal referrals in these cases rarely produce outcomes within a timeline that helps the victim.

What Monitoring Actually Caught This

The detection came from passive DNS monitoring that watched for new domain registrations matching brand patterns. This is the same capability discussed in Module 8 (Brand Protection) and Module 4 (Advanced Security). The point here is that the capability wasn't deployed reactively — it was already running before the incident happened.

Security tooling that requires an incident to justify its existence is security tooling that will often be deployed too late.

Key Takeaways

DNS launch configuration and email auth are launch requirements, not post-launch cleanup. The SaaS case study's email deliverability problem cost more to fix than it would have cost to prevent.
Domain migrations should separate DNS migration from registrar migration. Move the zone data first, validate it, then transfer the registration.
Passive DNS monitoring needs to be running before an incident, not in response to one. Detection speed is everything in phishing response.
Every one of these cases touched multiple modules simultaneously. DNS doesn't fail in isolation.

Up Next

Lesson 2 — DNS in DevOps Workflows: Infrastructure-as-code for DNS, GitOps patterns, CI/CD integration, and how to stop making manual record changes in production.