Case Studies: Advanced DNS in Production

Reading about DNS mechanisms is one thing. Watching them succeed or fail in production is another. These four cases are drawn from documented incidents, published post-mortems, and patterns I've observed across 20 years of DNS operations. They illustrate how the techniques from this module play out when the stakes are real.

Case 1: CDN DNS failover averting a major outage

The setup. A global e-commerce company runs its product catalog and checkout behind a CDN (Fastly in this case). The origin servers are a fleet of application servers in two data centers. The CDN handles traffic distribution, caching, and TLS termination. Health checks run from the CDN layer to the origin every 10 seconds.

What happened. A bad configuration push to the origin servers in the primary data center caused all origin servers to return 500 errors. The application was broken; it wasn't a network issue. CDN caching shielded users from the problem initially — cached responses continued serving successfully for several minutes.

As cache TTLs expired, the CDN began attempting to revalidate content at the origin. The origins returned 500s. Fastly's health check system detected the origin as unhealthy — when 100% of requests to an origin return 5xx over a rolling window, the health check fails.

The DNS part. Fastly uses DNS-based origin selection. When the primary origin's health check failed, Fastly's DNS layer stopped including the primary origin's IP in routing decisions and shifted 100% of traffic to the secondary data center within approximately 30 seconds.

The secondary data center didn't have the bad config push. Traffic flowed normally. Users experienced degraded performance for ~3 minutes (while Fastly's health check thresholds were being crossed) and then service resumed at the secondary origin.

The lesson. DNS-based failover isn't just about network failures — it covers any failure mode the health check can detect. In this case, the health check was simple: HTTP 200 vs non-200. That's enough. The failover TTL was 30 seconds, which meant stale routing persisted for at most 30 seconds after the health check flipped. The company's RTO (recovery time objective) for the CDN layer was under 60 seconds. They hit it.

What could have gone wrong. If the secondary data center had been pulling configuration from the same deployment system, it would have gotten the same bad config. The failover would have switched traffic to another broken origin. DNS failover doesn't protect you if the failure is in shared configuration.

Case 2: ML-based DNS detection catching active C2 communication

The setup. A financial services company runs Cisco Umbrella as their DNS security layer. All corporate DNS queries go through Umbrella's resolvers. The company has ~4,000 endpoints.

What happened. A workstation in the finance department was compromised via a phishing email with a macro-enabled attachment. The malware established persistence and began communicating with a command-and-control server using DNS tunneling — encoding instructions in DNS query subdomains to a domain registered the previous week.

The traditional perimeter tools didn't catch it. The phishing email passed spam filters. The initial payload was obfuscated enough to avoid signature-based AV. The DNS tunneling traffic looked like normal DNS at a glance.

The detection. Umbrella's ML layer flagged the workstation for two reasons simultaneously:

First, the query volume. The workstation sent 847 queries to subdomains of the C2 domain in a 15-minute window. Cisco Umbrella's behavioral model identifies endpoints sending unusually high query rates to a single parent domain — this is a standard DNS tunneling signal. The workstation's average query rate across all domains typically was around 30 queries per hour; it suddenly sent 847 to one domain in 15 minutes.

Second, the domain itself. The C2 domain was 6 days old, had no prior query history in Umbrella's global network (meaning no other Umbrella customer had ever queried it), and its subdomain patterns had high entropy. Umbrella's domain reputation model flagged it as likely malicious.

Umbrella blocked further queries to the domain and generated an alert. The security team quarantined the workstation within 40 minutes of initial compromise.

The lesson. The detection worked because of a combination of signals: query behavioral anomaly on the client side plus domain reputation on the domain side. Either alone might have generated a false positive or been bypassed (an attacker can slow their query rate; a newly registered domain alone doesn't guarantee malice). Together, they were sufficient to trigger action.

The caveats. The 40-minute window is relevant. DNS-layer blocking stopped the tunneling, but the malware was already running for some time before detection. In more sophisticated attacks, initial exfiltration or reconnaissance may have already occurred by the time DNS-layer detection triggers. DNS security is one layer, not a complete defense.

Case 3: IPv6 migration with unexpected split-brain

The setup. A software company decides to enable IPv6 on their public-facing services as part of a broader infrastructure modernization. Their DNS is managed by a third-party provider. Their authoritative nameservers are already accessible over IPv6 (the DNS provider handles that). They need to add AAAA records for their application servers.

What went wrong. The network team added AAAA records pointing to the IPv6 addresses of the application load balancers. The application team tested from the office — everything worked. They pushed the records to production.

Within an hour, a monitoring alert fired: 12% of users were experiencing connection failures.

Investigation revealed the problem: the company's application load balancers were running in a cloud environment where the IPv6 addresses were public-facing, but the security group rules had not been updated to allow inbound IPv6 traffic. IPv4 traffic was explicitly allowed on ports 80 and 443. IPv6 traffic had no rules — it was dropped.

Users with IPv6-only resolvers or clients that preferred IPv6 (via Happy Eyeballs) were getting the AAAA record, attempting an IPv6 connection, having it fail silently, and then waiting for the Happy Eyeballs timeout before falling back to IPv4. That timeout caused noticeable latency (250ms to 500ms per connection) rather than complete failure.

The split-brain component. When the team investigated further, they found a secondary problem: their internal monitoring ran DNS queries through an internal resolver that returned different results (internal IP addresses, not the CDN IPs) for a subset of domains. The internal resolver had been updated with AAAA records, but the internal application servers — behind the internal load balancers — also had IPv6 addresses that weren't listening on port 443. So internal employees were hitting a different set of broken IPv6 endpoints than external users.

The resolution. First, firewall rules were updated to allow IPv6 traffic on ports 80 and 443 — a 5-minute fix once identified. Second, the internal resolver's AAAA records were fixed to point to addresses that were actually listening. The monitoring alert cleared within 15 minutes of the firewall fix.

The lesson. Publishing AAAA records without verifying end-to-end IPv6 connectivity — including firewall rules, security groups, and application-layer listening — is the most common IPv6 DNS mistake. The monitoring infrastructure using a different DNS path than external users masked the problem during testing.

Post-incident, the team added IPv6-specific health checks to monitoring: curl -6 https://example.com alongside the existing curl https://example.com. They also added a pre-deployment checklist for any DNS change: "is this address actually reachable over the new protocol/network path?"

Case 4: Domain monitoring for brand protection catching a phishing campaign

The setup. EBRAND's X-RAY platform monitors domain registrations globally for patterns matching client brands. A client in the financial services sector (a major European bank) had configured monitoring for their brand name, common typosquats, and variations in major TLDs.

What happened. On a Tuesday morning, the X-RAY system flagged three domain registrations made in the previous 12 hours:

secure-[bankname]-login.com
[bankname]-accountverify.net
[bankname]support.org

All three were registered through a privacy-protected registrar. All three resolved to the same IP address — a VPS in Eastern Europe that was also hosting a fourth domain (a lookalike for a different financial institution in another country). The hosting IP had no prior history in passive DNS — the VPS appeared to have been freshly provisioned.

The DNS-layer evidence. The WHOIS data was minimal (privacy-protected). But passive DNS told a clearer story: the IP had appeared in DNS records within the past 48 hours and immediately started hosting multiple financial brand lookalikes. The pattern — rapid domain registration, immediate DNS activation, multiple targets simultaneously, privacy-protected registration — matched known phishing campaign profiles.

The X-RAY platform assigned these a high confidence score for phishing intent.

The response. The bank's security team was notified within hours of registration. They initiated takedown requests through the registrar for all three domains, submitted the domains to browser-based phishing blocklists (Google Safe Browsing, Microsoft SmartScreen), and added RPZ blocks to their corporate resolver to prevent any employees from resolving the lookalike domains.

The VPS IP was shared with relevant ISAC threat intel sharing groups. Within 24 hours, the domains were suspended by the registrar.

The timing advantage. The phishing domains were registered but the phishing campaign hadn't fully launched — no phishing emails had been sent yet to customers (or if they had, in very small test volumes). By acting at the domain registration stage rather than waiting for customer reports, the bank was able to shut down the infrastructure before significant user harm occurred.

The lesson. DNS monitoring at the registration layer provides a different kind of early warning than monitoring at the query layer. Query-layer monitoring (RPZ blocks, DNS security products) catches attacks after they've started targeting your users. Registration-layer monitoring can catch them before. The overlap between the two — detecting a new domain at registration and immediately adding it to RPZ blocks — closes most of the window.

Themes across the cases

Looking at all four cases together, a few patterns emerge:

DNS layer decisions have narrow time windows. CDN DNS failover, health-check-triggered routing changes, and RPZ updates all operate in seconds to minutes. That speed is the value. But it also means DNS-based defenses need to be automated — manual processes are too slow.

DNS is one layer. In each case, DNS-layer security or DNS-based failover was part of a system, not a complete solution. The CDN failover required healthy origin infrastructure at the secondary site. The ML detection required endpoint response to contain the threat. The phishing detection required registrar and browser vendor cooperation to execute the takedown.

Observability is the prerequisite. In the IPv6 case, the incident wouldn't have been detected quickly without the monitoring alert. In the ML detection case, the behavior anomaly was only visible because all DNS queries were logged and analyzed. You can't detect or respond to what you can't see.

Key takeaways

CDN DNS failover works when secondary infrastructure is independently healthy — it doesn't protect against shared failure modes
DNS-layer ML detection is most effective when combining behavioral (query volume/pattern) and reputational (domain age, prior history) signals
IPv6 AAAA records require end-to-end verification including firewall rules, security groups, and actual listening applications
Domain registration monitoring catches phishing campaigns earlier than query-layer detection alone
All DNS-based security operates in narrow time windows and requires automated response to be effective