DNS-Based Load Balancing and Traffic Management

DNS load balancing is one of those ideas that sounds better than it is. Once you understand exactly where it breaks down, it becomes a precise tool you can apply correctly.

The concept is simple: publish multiple A records for a hostname, and resolvers return them in different orders to different clients. Each client connects to whichever IP is first in the response. Traffic distributes across your servers.

The reality is messier. Let's go through it properly.

Round-Robin DNS: The Basics and the Problem

api.example.com. 60 IN A 203.0.113.1
api.example.com. 60 IN A 203.0.113.2
api.example.com. 60 IN A 203.0.113.3

The authoritative DNS server rotates the order of these records in responses. Client A gets [1, 2, 3], client B gets [2, 3, 1], client C gets [3, 1, 2]. Each connects to the first IP. Roughly equal distribution.

The problem is client-side caching. When a browser or OS resolver caches these records, it caches the entire set with the TTL. For the duration of that TTL, the same client always connects to the same IP. A single client doesn't cause issues, but at the layer between DNS and your servers, you have:

OS resolver caching: Each machine caches the DNS response for the TTL duration
Application-level caching: HTTP clients often pin connections to specific IPs
CDN caching: CDN PoPs may cache DNS responses and send all traffic through one PoP to your origins

The result: with a 300-second TTL, you might have some servers handling 3x the load of others for 5-minute windows as different caches expire at different times.

# Simulate what different clients see with round-robin
# This shows why round-robin doesn't guarantee even distribution

import dns.resolver

def query_multiple_times(hostname: str, count: int = 5):
    resolver = dns.resolver.Resolver()
    # Disable caching to see raw rotation
    resolver.cache = None

    for i in range(count):
        answer = resolver.resolve(hostname, 'A')
        ips = [rdata.address for rdata in answer]
        print(f"Query {i+1}: {ips[0]} (first in response)")

query_multiple_times('api.example.com')
# Query 1: 203.0.113.1
# Query 2: 203.0.113.2
# Query 3: 203.0.113.3
# Query 4: 203.0.113.1  (rotates back)
# Query 5: 203.0.113.2

Weighted DNS

Most DNS providers (Route 53, Cloudflare, NS1) support weighted records, where you assign a relative weight to each backend:

AWS Route 53 weighted routing (CLI setup):

# Create weighted records for three backends
# Weight 50 = 50% of traffic (relative to sum of all weights)

aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "backend-1",
          "Weight": 50,
          "TTL": 60,
          "ResourceRecords": [{"Value": "203.0.113.1"}]
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "backend-2",
          "Weight": 30,
          "TTL": 60,
          "ResourceRecords": [{"Value": "203.0.113.2"}]
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "backend-3",
          "Weight": 20,
          "TTL": 60,
          "ResourceRecords": [{"Value": "203.0.113.3"}]
        }
      }
    ]
  }'

Traffic distribution: 50% to backend-1, 30% to backend-2, 20% to backend-3. This is useful for canary deployments: set a new backend to weight 5 while the stable backend stays at 95. Gradually shift traffic without changing your load balancer config.

Geo-DNS: Route by Client Location

Geo-DNS returns different answers based on where the DNS query comes from. A client in Europe gets a European IP; a client in the US gets a US IP.

Route 53 geolocation routing:

aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "europe",
          "GeoLocation": {"ContinentCode": "EU"},
          "TTL": 60,
          "ResourceRecords": [{"Value": "198.51.100.10"}]
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "north-america",
          "GeoLocation": {"ContinentCode": "NA"},
          "TTL": 60,
          "ResourceRecords": [{"Value": "198.51.100.20"}]
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.example.com",
          "Type": "A",
          "SetIdentifier": "default",
          "GeoLocation": {"CountryCode": "*"},
          "TTL": 60,
          "ResourceRecords": [{"Value": "198.51.100.30"}]
        }
      }
    ]
  }'

The "default" record with CountryCode: "*" catches everything that doesn't match a specific rule. Always include one.

The accuracy problem: Geo-DNS works by mapping the querying resolver's IP to a location, not the client's IP. When a user in France is using Google's public resolver (8.8.8.8, located in the US), Route 53 sees a US IP and routes them to your US backend. This is why EDNS Client Subnet (ECS) exists: it passes a truncated version of the client's IP to the authoritative server so geo-routing can be accurate. Covered in detail in lesson 07.

Failover with Low TTL and Health Checks

Route 53 failover routing pairs DNS with health checks:

# Primary record with health check
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "primary",
        "Failover": "PRIMARY",
        "HealthCheckId": "abc12345-...",
        "TTL": 30,
        "ResourceRecords": [{"Value": "203.0.113.1"}]
      }
    }, {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "secondary",
        "Failover": "SECONDARY",
        "TTL": 30,
        "ResourceRecords": [{"Value": "203.0.113.2"}]
      }
    }]
  }'

Route 53 health checkers (from ~15 global locations) poll your primary endpoint every 10-30 seconds. If it fails, Route 53 stops returning the primary record. Clients re-resolving within the 30-second TTL window get the secondary IP.

Failover time math: worst case is health check interval (30s) + TTL (30s) + DNS propagation through resolvers (~5s). You're looking at 60-65 seconds of degraded service. That's acceptable for many applications and far better than manual intervention.

Latency vs Availability Trade-off

Low TTL (30s) = fast failover, more DNS queries, potential latency spikes on cold lookups. High TTL (300s+) = slower failover, fewer queries, faster resolution from cache.

The right answer depends on your SLA. For public-facing APIs where 5-minute outages are unacceptable, use low TTL. For internal services where eventual convergence is fine, use higher TTLs.

One thing many teams miss: pre-lowering TTLs before planned changes. If your current TTL is 300s and you lower it the moment you start your migration, you'll have stragglers cached for 5 minutes. If you lower the TTL to 30s 24 hours in advance, all resolvers will have expired the old cache before your change window.

The CNAME-at-Apex Problem

One Route 53 constraint trips up many developers: you cannot publish a CNAME record at your zone apex (the root of your domain). RFC 1034 prohibits it. If your domain is example.com, you cannot do this:

; This is invalid — CNAME at apex breaks the zone
example.com.  IN CNAME  d1234.cloudfront.net.

This matters for load balancing because many managed load balancers and CDNs give you a hostname to point at, not an IP. If you want example.com (not www.example.com) to point to an ALB or CloudFront distribution, a standard CNAME won't work.

Route 53 ALIAS records solve this for AWS-hosted endpoints. An ALIAS record behaves like a CNAME at the wire level but is implemented as an A/AAAA record by Route 53: it resolves the target and returns the resulting IPs to the client.

# Create an ALIAS record at the apex pointing to an ALB
aws route53 change-resource-record-sets   --hosted-zone-id Z1234567890   --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "example.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "my-alb-1234567890.us-east-1.elb.amazonaws.com.",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

Route 53 ALIAS records work for ALBs, CloudFront, S3 static websites, and other Route 53 routing targets. They don't work for arbitrary third-party hostnames (for that, use www. or a different subdomain with a real CNAME).

If you're not on Route 53, check your DNS provider's equivalent: Cloudflare has CNAME flattening, NS1 has ALIAS records, and many others have adopted similar mechanisms. Not all providers support this.

When to Use DNS LB vs a Real Load Balancer

Scenario	Use DNS LB	Use Application LB (ALB/nginx/HAProxy)
Global geo-routing	Yes	Complex (needs Anycast or global LB)
Canary deployments	Yes (weighted DNS)	Yes, and more precise
Session stickiness	No	Yes
SSL termination	No	Yes
Path-based routing	No	Yes
Per-request load balancing	No	Yes
Health-check-based failover	Yes (with provider support)	Yes, and faster
Per-connection load balancing	No	Yes

DNS load balancing works at connection granularity, not request granularity. Once a client has an IP and makes a connection, DNS is done. An application load balancer can route each HTTP request independently, which is why it can do path-based routing, sticky sessions, and per-request health checking.

Use DNS load balancing for traffic distribution across regions and availability zones at the macro level. Use application load balancers for everything within a region.

Key Takeaways

Round-robin DNS distributes load across lookups, not across requests; client caching makes actual distribution uneven
Weighted DNS is the right tool for canary deployments and traffic shaping at the DNS layer
Geo-DNS accuracy depends on EDNS Client Subnet support; without ECS, resolver location is used, not client location
Failover via Route 53 health checks works well with TTLs of 30-60 seconds; budget 60-90 seconds for full failover
Lower TTL at least one full TTL period before a planned DNS change to avoid stragglers
DNS load balancing is for geographic/availability-zone distribution; use an application LB for per-request routing

Up Next

Lesson 04 covers DANE (DNS-based Authentication of Named Entities), which uses DNSSEC to cryptographically bind TLS certificates to DNS records. More complex, but genuinely useful in specific contexts.