Module 3 · Lesson 8
Hands-on: Building DNS-Aware Applications
⏱ 60 min
Build a working service discovery system using SRV records and a DNS TTL-based failover detector. Full code in Python and Go. Run it yourself.
Hands-on: Building DNS-Aware Applications
Time to build something real. This lesson has two projects:
- A service registry and discovery system using DNS SRV records — services self-register via a DNS provider API, and clients discover them by querying DNS.
- A failover detector that watches DNS TTLs — monitors a service's DNS records, detects when they change or expire, and triggers callbacks.
Both projects are fully working. Run them locally with a test DNS server, or point them at a real DNS provider.
Project 1: DNS-Based Service Discovery (Python)
We'll use a local CoreDNS instance to simulate a real DNS environment, so you can run this without touching live DNS.
Setup: Local CoreDNS
# docker-compose.yml
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
coredns:
image: coredns/coredns:latest
ports:
- "5353:53/udp"
- "5353:53/tcp"
volumes:
- ./coredns:/etc/coredns
command: -conf /etc/coredns/Corefile
service-a:
image: python:3.11-slim
command: python /app/service.py service-a 8080
volumes:
- ./:/app
environment:
- DNS_SERVER=coredns
- DNS_PORT=53
depends_on:
- coredns
service-b:
image: python:3.11-slim
command: python /app/service.py service-b 8081
volumes:
- ./:/app
environment:
- DNS_SERVER=coredns
- DNS_PORT=53
depends_on:
- coredns
EOF
# CoreDNS configuration
mkdir -p coredns
cat > coredns/Corefile << 'EOF'
services.local:53 {
file /etc/coredns/services.local.zone
reload 5s
}
.:53 {
forward . 8.8.8.8
cache 30
}
EOF
# Initial zone file (empty — services will "register" by updating this file)
cat > coredns/services.local.zone << 'EOF'
$ORIGIN services.local.
$TTL 30
@ IN SOA ns1 admin (
2024010101 ; serial
3600 ; refresh
900 ; retry
86400 ; expire
30 ) ; minimum
@ IN NS ns1.
ns1 IN A 127.0.0.1
EOF
The Registry: Writing SRV Records
In a real system, your registry would write to Route 53 or another DNS provider. Here we write to CoreDNS's zone file to simulate it.
# registry.py
import os
import socket
import signal
import time
import threading
from dataclasses import dataclass
from pathlib import Path
import dns.resolver
ZONE_FILE = os.getenv('ZONE_FILE', './coredns/services.local.zone')
DNS_SERVER = os.getenv('DNS_SERVER', '127.0.0.1')
DNS_PORT = int(os.getenv('DNS_PORT', '5353'))
@dataclass
class ServiceRegistration:
name: str
host: str
port: int
priority: int = 10
weight: int = 50
ttl: int = 30
class DNSRegistry:
"""
Manages SRV record registration in CoreDNS zone file.
In production, replace write_zone_file with Route 53 API calls.
"""
def __init__(self, zone: str = "services.local"):
self.zone = zone
self._registrations: dict[str, ServiceRegistration] = {}
self._lock = threading.Lock()
def register(self, reg: ServiceRegistration) -> None:
with self._lock:
self._registrations[f"{reg.name}:{reg.host}:{reg.port}"] = reg
self._write_zone_file()
print(f"Registered: {reg.name} at {reg.host}:{reg.port}")
def deregister(self, name: str, host: str, port: int) -> None:
key = f"{name}:{host}:{port}"
with self._lock:
if key in self._registrations:
del self._registrations[key]
self._write_zone_file()
print(f"Deregistered: {name} at {host}:{port}")
def _write_zone_file(self) -> None:
serial = int(time.time())
lines = [
f"$ORIGIN {self.zone}.",
"$TTL 30",
f"@ IN SOA ns1 admin (",
f" {serial} ; serial",
" 3600 ; refresh",
" 900 ; retry",
" 86400 ; expire",
" 30 ) ; minimum",
"",
"@ IN NS ns1.",
"ns1 IN A 127.0.0.1",
"",
]
# Group registrations by service name
by_service: dict[str, list[ServiceRegistration]] = {}
for reg in self._registrations.values():
by_service.setdefault(reg.name, []).append(reg)
# Write SRV records
for service_name, regs in by_service.items():
for reg in regs:
srv_line = (
f"_{service_name}._tcp {reg.ttl} IN SRV "
f"{reg.priority} {reg.weight} {reg.port} {reg.host}."
)
lines.append(srv_line)
# Also write an A record for the host if it's an IP
try:
socket.inet_aton(reg.host)
host_label = reg.host.replace('.', '-')
lines.append(f"{host_label} {reg.ttl} IN A {reg.host}")
except socket.error:
pass # It's a hostname, not an IP
Path(ZONE_FILE).write_text('\n'.join(lines) + '\n')
# Global registry
_registry = DNSRegistry()
def register_service(name: str, port: int, host: str = None) -> None:
"""Register this process as a service instance."""
if host is None:
host = socket.gethostbyname(socket.gethostname())
reg = ServiceRegistration(name=name, host=host, port=port)
_registry.register(reg)
# Auto-deregister on process exit
def cleanup(signum, frame):
print(f"\nDeregistering {name} at {host}:{port}")
_registry.deregister(name, host, port)
exit(0)
signal.signal(signal.SIGTERM, cleanup)
signal.signal(signal.SIGINT, cleanup)
The Client: Discovering Services
# client.py
import time
import dns.resolver
from dataclasses import dataclass
DNS_SERVER = os.getenv('DNS_SERVER', '127.0.0.1')
DNS_PORT = int(os.getenv('DNS_PORT', '5353'))
@dataclass
class Endpoint:
host: str
port: int
priority: int
weight: int
class ServiceDiscovery:
"""
Resolves service endpoints via DNS SRV records.
Caches results for TTL duration, re-queries when expired.
"""
def __init__(self):
self._resolver = dns.resolver.Resolver()
self._resolver.nameservers = [DNS_SERVER]
self._resolver.port = DNS_PORT
self._resolver.timeout = 2.0
self._cache: dict[str, tuple[list[Endpoint], float]] = {}
def discover(self, service_name: str, zone: str = "services.local") -> list[Endpoint]:
cache_key = f"{service_name}.{zone}"
cached = self._cache.get(cache_key)
if cached:
endpoints, expires_at = cached
if time.monotonic() < expires_at:
return endpoints
# Cache miss or expired — query DNS
query_name = f"_{service_name}._tcp.{zone}"
try:
answer = self._resolver.resolve(query_name, 'SRV')
except dns.resolver.NXDOMAIN:
print(f"No service found: {service_name}")
return []
except dns.resolver.NoAnswer:
print(f"No SRV records for: {service_name}")
return []
endpoints = []
for rdata in answer:
host = str(rdata.target).rstrip('.')
endpoints.append(Endpoint(
host=host,
port=rdata.port,
priority=rdata.priority,
weight=rdata.weight,
))
# Sort: lowest priority first, highest weight first within same priority
endpoints.sort(key=lambda e: (e.priority, -e.weight))
# Cache for the TTL duration
self._cache[cache_key] = (endpoints, time.monotonic() + answer.ttl)
return endpoints
def get_primary(self, service_name: str, zone: str = "services.local") -> Endpoint | None:
endpoints = self.discover(service_name, zone)
return endpoints[0] if endpoints else None
# service.py — a simple service that registers itself and serves HTTP
import sys
import http.server
import threading
from registry import register_service
def run_service(name: str, port: int):
# Register in DNS
register_service(name, port, host="127.0.0.1")
# Serve simple HTTP
class Handler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(f"Hello from {name} on port {port}\n".encode())
def log_message(self, format, *args):
pass # Silence access logs
server = http.server.HTTPServer(('', port), Handler)
print(f"{name} listening on port {port}")
server.serve_forever()
if __name__ == '__main__':
name = sys.argv[1] if len(sys.argv) > 1 else 'my-service'
port = int(sys.argv[2]) if len(sys.argv) > 2 else 8080
run_service(name, port)
Running It
# Start everything
docker-compose up -d
# Watch DNS records update as services start
watch -n 1 'dig @127.0.0.1 -p 5353 _service-a._tcp.services.local SRV'
# Test discovery from within the network
docker-compose exec service-a python3 -c "
from client import ServiceDiscovery
sd = ServiceDiscovery()
ep = sd.get_primary('service-b')
print(f'Found service-b at {ep.host}:{ep.port}' if ep else 'Not found')
"
# Stop a service and watch the record disappear
docker-compose stop service-b
# Within 30 seconds (the TTL), queries return NXDOMAIN
Project 2: DNS TTL Failover Detector (Go)
This monitors a hostname's DNS records, detects changes, and calls a handler when the set of IPs changes. Useful for building failover-aware clients that react when DNS-based failover triggers.
// failover_detector.go
package main
import (
"context"
"fmt"
"net"
"sort"
"strings"
"time"
)
// ChangeEvent describes a DNS change
type ChangeEvent struct {
Hostname string
Before []string
After []string
Timestamp time.Time
}
func (e ChangeEvent) String() string {
return fmt.Sprintf(
"[%s] %s: [%s] -> [%s]",
e.Timestamp.Format("15:04:05"),
e.Hostname,
strings.Join(e.Before, ", "),
strings.Join(e.After, ", "),
)
}
// ChangeHandler is called when DNS records change
type ChangeHandler func(event ChangeEvent)
// Monitor watches a hostname and calls handler when its A records change
type Monitor struct {
hostname string
interval time.Duration
handler ChangeHandler
current []string
resolver *net.Resolver
}
func NewMonitor(hostname string, interval time.Duration, handler ChangeHandler) *Monitor {
return &Monitor{
hostname: hostname,
interval: interval,
handler: handler,
resolver: net.DefaultResolver,
}
}
func (m *Monitor) resolve(ctx context.Context) ([]string, error) {
addrs, err := m.resolver.LookupHost(ctx, m.hostname)
if err != nil {
return nil, err
}
sort.Strings(addrs)
return addrs, nil
}
func strSliceEqual(a, b []string) bool {
if len(a) != len(b) {
return false
}
for i := range a {
if a[i] != b[i] {
return false
}
}
return true
}
func (m *Monitor) Run(ctx context.Context) error {
// Initial resolution
addrs, err := m.resolve(ctx)
if err != nil {
return fmt.Errorf("initial resolution failed for %s: %w", m.hostname, err)
}
m.current = addrs
fmt.Printf("[%s] Watching %s → %s\n",
time.Now().Format("15:04:05"),
m.hostname,
strings.Join(addrs, ", "),
)
ticker := time.NewTicker(m.interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
addrs, err := m.resolve(ctx)
if err != nil {
fmt.Printf("Resolution error for %s: %v\n", m.hostname, err)
continue
}
if !strSliceEqual(m.current, addrs) {
event := ChangeEvent{
Hostname: m.hostname,
Before: m.current,
After: addrs,
Timestamp: time.Now(),
}
m.current = addrs
m.handler(event)
}
}
}
}
// MultiMonitor watches multiple hostnames concurrently
type MultiMonitor struct {
monitors []*Monitor
}
func NewMultiMonitor(hostnames []string, interval time.Duration, handler ChangeHandler) *MultiMonitor {
monitors := make([]*Monitor, len(hostnames))
for i, h := range hostnames {
monitors[i] = NewMonitor(h, interval, handler)
}
return &MultiMonitor{monitors: monitors}
}
func (mm *MultiMonitor) Run(ctx context.Context) {
for _, m := range mm.monitors {
m := m // capture loop variable
go func() {
if err := m.Run(ctx); err != nil && err != context.Canceled {
fmt.Printf("Monitor error for %s: %v\n", m.hostname, err)
}
}()
}
<-ctx.Done()
}
func main() {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Handle signals
go func() {
// In production: signal.NotifyContext or os/signal handling
time.Sleep(5 * time.Minute)
cancel()
}()
onchange := func(event ChangeEvent) {
fmt.Printf("DNS CHANGE DETECTED: %s\n", event)
// In production: update connection pool, alert on-call, log to metrics
// Example: trigger a reconnection to the new primary
fmt.Printf(" Action: updating connection pool to use %s\n",
strings.Join(event.After, ", "))
}
// Watch services for DNS changes, check every 10 seconds
// In production, set interval to match expected TTL
mm := NewMultiMonitor(
[]string{
"api.example.com",
"db.example.com",
},
10*time.Second,
onchange,
)
mm.Run(ctx)
}
Testing the Failover Detector
To see it react to a DNS change, you need something that actually changes. With the CoreDNS setup from Project 1:
# Terminal 1: Run the detector
go run failover_detector.go
# Terminal 2: Simulate a failover by updating the zone
# Change the A record for a service from 127.0.0.1 to 127.0.0.2
sed -i 's/127.0.0.1/127.0.0.2/' coredns/services.local.zone
# CoreDNS reloads the zone every 5s (per Corefile config)
# Within 10 seconds, the detector fires:
# DNS CHANGE DETECTED: [15:04:23] api.example.com: [127.0.0.1] -> [127.0.0.2]
Making the Detector TTL-Aware
The polling interval above is fixed. A production version should respect the DNS TTL:
func (m *Monitor) RunWithTTL(ctx context.Context) error {
for {
ctx2, cancel := context.WithTimeout(ctx, 5*time.Second)
addrs, ttl, err := m.resolveWithTTL(ctx2)
cancel()
if err != nil {
// Back off and retry
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(5 * time.Second):
continue
}
}
if !strSliceEqual(m.current, addrs) && len(m.current) > 0 {
event := ChangeEvent{
Hostname: m.hostname,
Before: m.current,
After: addrs,
Timestamp: time.Now(),
}
m.handler(event)
}
m.current = addrs
// Wait until just before TTL expires, then re-query
waitDuration := time.Duration(ttl) * time.Second
if waitDuration < 5*time.Second {
waitDuration = 5 * time.Second // Don't hammer DNS
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(waitDuration):
}
}
}
func (m *Monitor) resolveWithTTL(ctx context.Context) ([]string, uint32, error) {
// Use miekg/dns for TTL visibility
// (simplified — see lesson 01 for full implementation)
addrs, err := m.resolver.LookupHost(ctx, m.hostname)
if err != nil {
return nil, 0, err
}
sort.Strings(addrs)
return addrs, 30, nil // Replace 30 with actual TTL from miekg/dns
}
What You've Built
Project 1 gives you:
- A service registry that writes SRV records to DNS
- Service instances that self-register on start and deregister on SIGTERM
- A discovery client that queries SRV records and respects TTL-based caching
- Everything running in Docker, testable locally
Project 2 gives you:
- A multi-host DNS change monitor
- Configurable change handlers for automated failover responses
- TTL-aware polling that minimizes unnecessary DNS queries
- A pattern you can adapt for database failover detection, CDN origin monitoring, or service health tracking
Both use real DNS protocols on real resolvers. No mocking, no stubs.
Key Takeaways
- SRV records are the right building block for DNS-based service discovery: they encode host, port, priority, and weight in a single query
- Self-registration (service writes its own DNS record on startup, removes it on shutdown) is simpler than a central registry for small systems
- TTL-based caching in your discovery client is the difference between one DNS query per minute and one per request
- The failover detector pattern is broadly useful: database primary detection, CDN origin health, multi-region failover, load balancer membership changes
- CoreDNS's
reloaddirective (checking zone file every N seconds) makes local development match production behavior closely
Further Reading
That's Module 3. You've gone from getaddrinfo() to a working service discovery system. Module 4 covers DNS security operations: incident response, monitoring for DNS hijacking, and operating DNSSEC at scale.