Module 9 · Lesson 2
DNS in DevOps Workflows
Infrastructure as code, GitOps for DNS, CI/CD integration, blue/green deployments, and monitoring. Real Terraform HCL and OctoDNS examples.
DNS in DevOps Workflows
DNS management doesn't stop being a manual, console-clicking operation because someone decided to call it "infrastructure." In most organizations, it stays manual long after everything else has been automated — because DNS feels different, feels risky, feels like the kind of thing you don't want to run through a pipeline.
That feeling is understandable. It's also wrong. Manual DNS changes are how you get typos in production, undocumented one-off records that nobody remembers creating, and drift between what's in your runbooks and what's actually deployed.
This lesson covers how DNS fits into modern infrastructure workflows: IaC, GitOps, CI/CD, and monitoring integration.
Infrastructure as Code: DNS with Terraform
Terraform has mature providers for all the major DNS platforms. The approach is straightforward: your DNS records live in .tf files, version-controlled alongside the infrastructure they support, reviewed in PRs, and applied via a pipeline.
Route 53 Example
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
# Reference an existing hosted zone
data "aws_route53_zone" "primary" {
name = "example.com."
private_zone = false
}
# A record for the apex domain
resource "aws_route53_record" "apex" {
zone_id = data.aws_route53_zone.primary.zone_id
name = "example.com"
type = "A"
alias {
name = aws_lb.main.dns_name
zone_id = aws_lb.main.zone_id
evaluate_target_health = true
}
}
# CNAME for www
resource "aws_route53_record" "www" {
zone_id = data.aws_route53_zone.primary.zone_id
name = "www.example.com"
type = "CNAME"
ttl = 300
records = ["example.com"]
}
# MX records
resource "aws_route53_record" "mx" {
zone_id = data.aws_route53_zone.primary.zone_id
name = "example.com"
type = "MX"
ttl = 3600
records = [
"10 mail1.example.com",
"20 mail2.example.com"
]
}
# SPF
resource "aws_route53_record" "spf" {
zone_id = data.aws_route53_zone.primary.zone_id
name = "example.com"
type = "TXT"
ttl = 3600
records = ["v=spf1 include:_spf.google.com include:sendgrid.net ~all"]
}
# DMARC
resource "aws_route53_record" "dmarc" {
zone_id = data.aws_route53_zone.primary.zone_id
name = "_dmarc.example.com"
type = "TXT"
ttl = 3600
records = ["v=DMARC1; p=reject; rua=mailto:dmarc-reports@example.com; ruf=mailto:dmarc-failures@example.com; fo=1"]
}
Cloudflare Example
terraform {
required_providers {
cloudflare = {
source = "cloudflare/cloudflare"
version = "~> 4.0"
}
}
}
provider "cloudflare" {
api_token = var.cloudflare_api_token
}
data "cloudflare_zone" "example" {
name = "example.com"
}
resource "cloudflare_record" "apex" {
zone_id = data.cloudflare_zone.example.id
name = "@"
value = "203.0.113.10"
type = "A"
ttl = 1 # 1 = automatic (Cloudflare-managed)
proxied = true
}
resource "cloudflare_record" "www" {
zone_id = data.cloudflare_zone.example.id
name = "www"
value = "example.com"
type = "CNAME"
ttl = 1
proxied = true
}
# API subdomain — not proxied, direct DNS
resource "cloudflare_record" "api" {
zone_id = data.cloudflare_zone.example.id
name = "api"
value = "203.0.113.20"
type = "A"
ttl = 300
proxied = false
}
What IaC for DNS Gives You
- Audit trail. Every change is a commit. Who changed what TTL, when, why — it's in git history.
- Review process. DNS changes go through the same PR review as application code. A teammate catches the typo in the SPF record before it deploys.
- Drift detection.
terraform planagainst your live DNS shows you if someone made a manual change through the console. - Reproducibility. Spin up an identical staging environment, including DNS records, with the same configuration.
The Practical Tradeoff
Terraform's DNS management works well for records that change infrequently. For highly dynamic records — health-check-driven routing, per-deployment feature flags via DNS — the Terraform apply cycle may be too slow. In those cases, you'd use the DNS provider's API directly or a more dynamic tool.
GitOps for DNS: OctoDNS
OctoDNS treats DNS zone data as source-of-truth configuration files, with support for syncing to multiple providers simultaneously. This is the right tool when you want multi-provider DNS managed from a single source.
Basic OctoDNS Configuration
Directory structure:
dns/
config/
octodns.yaml
zones/
example.com.yaml
octodns.yaml:
providers:
config:
class: octodns.provider.yaml.YamlProvider
directory: ./zones
default_ttl: 300
enforce_order: true
route53:
class: octodns_route53.Route53Provider
access_key_id: env/AWS_ACCESS_KEY_ID
secret_access_key: env/AWS_SECRET_ACCESS_KEY
cloudflare:
class: octodns_cloudflare.CloudflareProvider
token: env/CLOUDFLARE_TOKEN
zones:
example.com.:
sources:
- config
targets:
- route53
- cloudflare # secondary provider, receives same records
zones/example.com.yaml:
---
# Apex A record
'':
ttl: 300
type: A
values:
- 203.0.113.10
# www CNAME
www:
ttl: 300
type: CNAME
value: example.com.
# MX records
'':
- ttl: 3600
type: MX
values:
- priority: 10
value: mail1.example.com.
- priority: 20
value: mail2.example.com.
# SPF
'':
- ttl: 3600
type: TXT
value: "v=spf1 include:_spf.google.com ~all"
# DMARC
_dmarc:
ttl: 3600
type: TXT
value: "v=DMARC1; p=reject; rua=mailto:dmarc-reports@example.com"
Run a dry-run sync:
octodns-sync --config-file config/octodns.yaml --doit --no-dry-run
Without --no-dry-run, OctoDNS prints what it would change without applying anything. Always run the dry-run first.
Workflow Integration
In practice, the OctoDNS repository becomes the source of truth. The CI pipeline:
- PR opens with DNS change
- CI runs
octodns-sync --dry-runand posts the planned changes as a PR comment - Reviewer sees exactly which records will change before approving
- Merge triggers
octodns-sync --no-dry-runagainst both providers - Post-sync validation script queries both providers to confirm records are correct
External-DNS for Kubernetes
If you're running Kubernetes, external-dns bridges the gap between Kubernetes service objects and your DNS provider. It watches for Services and Ingress resources with annotations and creates the corresponding DNS records automatically.
Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
spec:
replicas: 1
selector:
matchLabels:
app: external-dns
template:
metadata:
labels:
app: external-dns
spec:
serviceAccountName: external-dns
containers:
- name: external-dns
image: registry.k8s.io/external-dns/external-dns:v0.14.0
args:
- --source=service
- --source=ingress
- --domain-filter=example.com
- --provider=cloudflare
- --cloudflare-proxied
- --policy=upsert-only # never delete records automatically
- --txt-owner-id=my-cluster
env:
- name: CF_API_TOKEN
valueFrom:
secretKeyRef:
name: cloudflare-credentials
key: api-token
An Ingress resource that external-dns will pick up:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app
annotations:
external-dns.alpha.kubernetes.io/hostname: app.example.com
external-dns.alpha.kubernetes.io/ttl: "300"
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app
port:
number: 80
When this Ingress is created, external-dns creates the corresponding DNS record in Cloudflare. When the Ingress is deleted, external-dns removes it. The --policy=upsert-only flag is important in early adoption — it prevents automated deletions until you trust the system.
The Container DNS Problem
One thing external-dns doesn't solve: how your containers resolve DNS internally. This is a different problem, covered in Module 3.
The short version: Kubernetes's default ndots:5 resolver configuration means that a query for api.example.com from inside a pod will attempt:
api.example.com.default.svc.cluster.localapi.example.com.svc.cluster.localapi.example.com.cluster.localapi.example.com(the actual hostname)
Before it reaches the real DNS lookup. This adds latency and load to CoreDNS. Fix it:
# In your pod spec
dnsConfig:
options:
- name: ndots
value: "1"
Or use fully qualified domain names (trailing dot) in your service URLs when calling external services.
CI/CD: DNS Changes in Deployment Pipelines
Blue/Green Deployments via DNS
The classic blue/green deployment uses DNS TTL manipulation to control traffic shifting. The DNS layer is the switch:
- Two environments running: blue (current production) and green (new version)
app.example.comA record pointing to blue, TTL 300s- Pre-deployment: lower TTL to 60s, wait for current TTL to expire across resolvers
- Deploy to green, validate
- Change A record to point to green
- Wait 60s for propagation
- Monitor error rates on green
- If healthy: raise TTL back to 300s; blue environment can be updated
- If unhealthy: revert the A record to blue; recovery is 60 seconds
The critical step is waiting for the old TTL to expire before making the switch. If you lower TTL from 3600s to 60s and immediately change the A record, some resolvers will cache the old A record for up to 59 more minutes.
A GitHub Actions workflow fragment for this:
- name: Lower TTL pre-deployment
run: |
aws route53 change-resource-record-sets \
--hosted-zone-id $ZONE_ID \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "$BLUE_IP"}]
}
}]
}'
# Wait for old TTL (300s) to expire
echo "Waiting 300 seconds for TTL propagation..."
sleep 300
- name: Switch to green
run: |
aws route53 change-resource-record-sets \
--hosted-zone-id $ZONE_ID \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "$GREEN_IP"}]
}
}]
}'
Secrets and DNS: Don't Leak Your Internal Topology
DNS can leak infrastructure information that you might not want public. When you set up Terraform or OctoDNS, think about what you're publishing:
- Internal service names in public DNS (
db.example.compointing to a private IP) - Subdomain names that reveal your technology stack (
wordpress.example.com,jenkins.example.com) - Development/staging endpoints with non-trivial data (
staging-with-real-data.example.com)
The pattern to use is split-horizon DNS: different views of the DNS namespace for internal vs external resolvers.
In Terraform with Route 53:
# Public zone — what the internet sees
resource "aws_route53_zone" "public" {
name = "example.com"
}
# Private zone — what your VPC sees
resource "aws_route53_zone" "private" {
name = "example.com"
vpc {
vpc_id = aws_vpc.main.id
}
}
# Public: only the app-facing records
resource "aws_route53_record" "app_public" {
zone_id = aws_route53_zone.public.zone_id
name = "app.example.com"
type = "A"
ttl = 300
records = [aws_lb.public.dns_name]
}
# Private: internal service discovery
resource "aws_route53_record" "db_private" {
zone_id = aws_route53_zone.private.zone_id
name = "db.example.com"
type = "A"
ttl = 60
records = [aws_db_instance.main.address]
}
resource "aws_route53_record" "cache_private" {
zone_id = aws_route53_zone.private.zone_id
name = "cache.example.com"
type = "A"
ttl = 60
records = [aws_elasticache_cluster.main.cache_nodes[0].address]
}
This way, db.example.com resolves correctly inside your VPC but doesn't exist in public DNS. The internal topology stays internal.
Monitoring Integration
Prometheus + DNS
The dnsdist and bind_exporter Prometheus exporters expose DNS metrics. But for most teams, the more useful monitoring is at the application/infrastructure level:
- NXDOMAIN rate spike = something is querying for names that don't exist (could be a misconfigured service, could be a sign of malware)
- SERVFAIL rate spike = authoritative server issues
- Resolution latency P95 above threshold = resolver degradation
- Query volume drop = traffic loss (if your app is querying DNS and suddenly stops, something upstream broke)
A Prometheus alert rule for NXDOMAIN rate:
groups:
- name: dns_alerts
rules:
- alert: HighNXDOMAINRate
expr: rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High NXDOMAIN rate on CoreDNS"
description: "NXDOMAIN responses are above 10/s for 2 minutes. Check for misconfigured services."
- alert: DNSResolutionLatencyHigh
expr: histogram_quantile(0.95, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "DNS resolution P95 latency above 100ms"
External DNS Health Checks
For authoritative DNS, monitoring from inside your infrastructure isn't enough — you need external validation that your domains are resolving correctly from the internet. Services like Catchpoint, Pingdom, or Uptime Robot can send DNS queries to your authoritative servers from multiple global locations and alert on failures or answer mismatches.
At minimum, configure external health checks for:
- Your primary domain's A record
- MX records (email delivery depends on this)
- Any SOA serial checks to verify zone transfers are working if you run multi-provider
Key Takeaways
- DNS-as-code with Terraform or OctoDNS gives you the same audit trail, review process, and reproducibility you expect from application deployments. Manual DNS changes are tech debt.
- OctoDNS is the right tool for multi-provider authoritative DNS managed from a single source of truth.
- External-dns handles the Kubernetes integration problem, but doesn't fix the container resolver configuration problem — those need separate attention.
- Blue/green via DNS TTL works reliably if you respect the TTL expiry window before switching.
- Split-horizon DNS in IaC keeps your internal topology out of public DNS. This is a security decision, not just an architecture preference.
- Monitor NXDOMAIN and SERVFAIL rates. Both are early signals for problems that will become customer-visible if left unaddressed.
Up Next
Lesson 3 — Best Practices Synthesis: Every key recommendation from the course, organized by role — developer, ops/SRE, and domain manager.