Integrating DNS into Application Architecture

The typical way to wire services together is environment variables: DATABASE_HOST=postgres.internal, REDIS_URL=redis://cache.internal:6379. It works. But it's a static snapshot of your infrastructure baked into your configuration at deploy time.

DNS gives you something different: a live directory. Change a record, and every service that re-resolves picks up the change within TTL seconds — no redeploy, no restart, no config update.

This lesson is about how to use DNS as an active part of your architecture rather than just a hostname-to-IP translator.

SRV Records: Service Discovery Without a Service Mesh

SRV records let you publish service location — host, port, priority, and weight — in DNS itself. The format is:

_service._proto.name. TTL IN SRV priority weight port target.

A real example for an internal PostgreSQL cluster:

_postgresql._tcp.db.internal. 30 IN SRV 10 50 5432 primary.db.internal.
_postgresql._tcp.db.internal. 30 IN SRV 20 50 5432 replica1.db.internal.
_postgresql._tcp.db.internal. 30 IN SRV 20 50 5432 replica2.db.internal.

Priority 10 = preferred (the primary). Priority 20 = fallback (replicas). Weight 50 = equal weighting among same-priority records.

Your application queries _postgresql._tcp.db.internal for SRV records and gets back the full picture: where to connect, what port to use, what the failover order is. No hardcoded port 5432. No environment variable. No service registry to maintain separately.

Here's how to query SRV records in Python:

import dns.resolver
from dataclasses import dataclass

@dataclass
class ServiceEndpoint:
    host: str
    port: int
    priority: int
    weight: int

def discover_service(service_name: str) -> list[ServiceEndpoint]:
    """
    Query SRV records for a service.
    Returns endpoints sorted by priority (lowest = most preferred).
    """
    resolver = dns.resolver.Resolver()
    resolver.timeout = 2.0

    try:
        answer = resolver.resolve(service_name, 'SRV')
    except dns.resolver.NXDOMAIN:
        raise RuntimeError(f"Service not found: {service_name}")
    except dns.resolver.NoAnswer:
        raise RuntimeError(f"No SRV records for: {service_name}")

    endpoints = []
    for rdata in answer:
        # rdata.target is a dns.name.Name — convert to string and strip trailing dot
        host = str(rdata.target).rstrip('.')
        endpoints.append(ServiceEndpoint(
            host=host,
            port=rdata.port,
            priority=rdata.priority,
            weight=rdata.weight,
        ))

    # Sort: lower priority number = higher preference
    endpoints.sort(key=lambda e: (e.priority, -e.weight))
    return endpoints

# Usage
endpoints = discover_service('_postgresql._tcp.db.internal')
primary = endpoints[0]
print(f"Connecting to {primary.host}:{primary.port}")

And in Go:

package main

import (
    "fmt"
    "net"
    "sort"
)

type ServiceEndpoint struct {
    Host     string
    Port     uint16
    Priority uint16
    Weight   uint16
}

func discoverService(service, proto, name string) ([]ServiceEndpoint, error) {
    // net.LookupSRV constructs the query: _service._proto.name
    _, addrs, err := net.LookupSRV(service, proto, name)
    if err != nil {
        return nil, fmt.Errorf("SRV lookup failed: %w", err)
    }

    endpoints := make([]ServiceEndpoint, 0, len(addrs))
    for _, addr := range addrs {
        endpoints = append(endpoints, ServiceEndpoint{
            Host:     addr.Target,
            Port:     addr.Port,
            Priority: addr.Priority,
            Weight:   addr.Weight,
        })
    }

    // net.LookupSRV already sorts by priority and weight per RFC 2782
    // but let's be explicit
    sort.Slice(endpoints, func(i, j int) bool {
        if endpoints[i].Priority != endpoints[j].Priority {
            return endpoints[i].Priority < endpoints[j].Priority
        }
        return endpoints[i].Weight > endpoints[j].Weight
    })

    return endpoints, nil
}

func main() {
    endpoints, err := discoverService("postgresql", "tcp", "db.internal")
    if err != nil {
        panic(err)
    }
    for _, ep := range endpoints {
        fmt.Printf("Priority %d: %s:%d\n", ep.Priority, ep.Host, ep.Port)
    }
}

DNS as a Configuration Layer

SRV records go beyond ports. DNS TXT records can carry arbitrary metadata:

_config.myapp.internal. 60 IN TXT "version=v2.3.1"
_config.myapp.internal. 60 IN TXT "feature_flag_dark_mode=true"
_config.myapp.internal. 60 IN TXT "rate_limit_per_sec=500"

This is a lightweight feature flag and configuration system with zero infrastructure — no Consul, no etcd, no database. Changes propagate within 60 seconds to all services that query this record. It's not a replacement for a proper config management system, but for small teams or simple flags it's remarkably effective.

def get_config(app_name: str) -> dict[str, str]:
    resolver = dns.resolver.Resolver()
    config = {}

    try:
        answer = resolver.resolve(f'_config.{app_name}.internal', 'TXT')
        for rdata in answer:
            for string in rdata.strings:
                key, _, value = string.decode().partition('=')
                config[key.strip()] = value.strip()
    except (dns.resolver.NXDOMAIN, dns.resolver.NoAnswer):
        pass  # No config in DNS — use defaults

    return config

cfg = get_config('myapp')
if cfg.get('feature_flag_dark_mode') == 'true':
    enable_dark_mode()

Health Checks via TTL

A pattern used in production at scale: set your DNS TTL very low (10-30 seconds) on records that map to potentially unhealthy backends. Your health check system — a separate process watching your services — updates DNS when it detects a failure. Clients re-resolve within TTL seconds and stop hitting the dead backend.

This works because DNS propagation for low-TTL records is fast (within TTL window). The trade-off is more DNS traffic, since resolvers re-query more frequently. At scale, that matters. But for internal services with a handful of backends, 10s TTL is fine.

The implementation is simple: your health checker uses your DNS provider's API to update or remove A records when backends fail. AWS Route 53 has built-in health checks that do this automatically.

The Caching Mistake

Here's the mistake almost everyone makes at some point: rolling your own DNS cache in application code.

It looks like this:

# DON'T DO THIS (at least not naively)
_dns_cache = {}

def resolve_cached(hostname: str) -> str:
    if hostname in _dns_cache:
        return _dns_cache[hostname]
    ip = socket.gethostbyname(hostname)
    _dns_cache[hostname] = ip  # Cached forever
    return ip

The problem: the entry never expires. DNS TTLs exist so that when you update your infrastructure, clients pick up the change within a bounded window. A permanent cache defeats that entirely. Your application will hold onto a stale IP until it gets a connection error — and then, if you're not careful, it'll fail for everyone simultaneously when the old server finally goes away.

The correct pattern is TTL-aware caching:

import time
import dns.resolver
from dataclasses import dataclass, field

@dataclass
class CacheEntry:
    addresses: list[str]
    expires_at: float

class TTLAwareDNSCache:
    def __init__(self):
        self._cache: dict[str, CacheEntry] = {}
        self._resolver = dns.resolver.Resolver()

    def resolve(self, hostname: str) -> list[str]:
        now = time.monotonic()
        entry = self._cache.get(hostname)

        if entry and entry.expires_at > now:
            return entry.addresses

        # Cache miss or expired — re-query
        answer = self._resolver.resolve(hostname, 'A')
        addresses = [rdata.address for rdata in answer]
        ttl = answer.ttl

        self._cache[hostname] = CacheEntry(
            addresses=addresses,
            expires_at=now + ttl,
        )
        return addresses

This respects TTLs. When DNS says a record expires in 30 seconds, your cache expires it in 30 seconds.

Architectures That Use DNS Well

GitLab's internal service routing: GitLab uses DNS SRV records internally to route between services. The registry at _registry._tcp.gitlab.internal tells clients which port the container registry is on, allowing them to move it without reconfiguring every client.

PostgreSQL with pgbouncer: The canonical pattern is a low-TTL CNAME that points to the current primary. When you promote a replica, you update the CNAME. PgBouncer (or your application's connection pool) re-resolves on the next connection attempt. Low TTL (15-30s) means the cutover happens quickly.

AWS service endpoints: AWS publishes all service endpoints via DNS. s3.us-east-1.amazonaws.com resolves to different IPs depending on which S3 servers are healthy. You never hardcode an S3 IP. This is DNS as a live directory in production, at massive scale.

Key Takeaways

SRV records encode service location (host + port + priority + weight) in DNS, giving you service discovery without a separate registry
DNS TXT records can carry arbitrary key=value configuration, propagating changes within TTL windows
Low TTL plus DNS updates is a simple, effective health check mechanism — but costs more DNS queries
Never cache DNS results without respecting TTLs; use a TTL-aware cache or let the system resolver handle it
Real architectures (GitLab, AWS internal routing, PostgreSQL failover patterns) use DNS actively, not just as a hostname translator

Up Next

Lesson 03 covers DNS-based load balancing: round-robin DNS, weighted records, geo-DNS, and when you should use a real load balancer instead.