DNS in DevOps Workflows

DNS management doesn't stop being a manual, console-clicking operation because someone decided to call it "infrastructure." In most organizations, it stays manual long after everything else has been automated — because DNS feels different, feels risky, feels like the kind of thing you don't want to run through a pipeline.

That feeling is understandable. It's also wrong. Manual DNS changes are how you get typos in production, undocumented one-off records that nobody remembers creating, and drift between what's in your runbooks and what's actually deployed.

This lesson covers how DNS fits into modern infrastructure workflows: IaC, GitOps, CI/CD, and monitoring integration.

Infrastructure as Code: DNS with Terraform

Terraform has mature providers for all the major DNS platforms. The approach is straightforward: your DNS records live in .tf files, version-controlled alongside the infrastructure they support, reviewed in PRs, and applied via a pipeline.

Route 53 Example

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# Reference an existing hosted zone
data "aws_route53_zone" "primary" {
  name         = "example.com."
  private_zone = false
}

# A record for the apex domain
resource "aws_route53_record" "apex" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "example.com"
  type    = "A"

  alias {
    name                   = aws_lb.main.dns_name
    zone_id                = aws_lb.main.zone_id
    evaluate_target_health = true
  }
}

# CNAME for www
resource "aws_route53_record" "www" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "www.example.com"
  type    = "CNAME"
  ttl     = 300
  records = ["example.com"]
}

# MX records
resource "aws_route53_record" "mx" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "example.com"
  type    = "MX"
  ttl     = 3600
  records = [
    "10 mail1.example.com",
    "20 mail2.example.com"
  ]
}

# SPF
resource "aws_route53_record" "spf" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "example.com"
  type    = "TXT"
  ttl     = 3600
  records = ["v=spf1 include:_spf.google.com include:sendgrid.net ~all"]
}

# DMARC
resource "aws_route53_record" "dmarc" {
  zone_id = data.aws_route53_zone.primary.zone_id
  name    = "_dmarc.example.com"
  type    = "TXT"
  ttl     = 3600
  records = ["v=DMARC1; p=reject; rua=mailto:dmarc-reports@example.com; ruf=mailto:dmarc-failures@example.com; fo=1"]
}

Cloudflare Example

terraform {
  required_providers {
    cloudflare = {
      source  = "cloudflare/cloudflare"
      version = "~> 4.0"
    }
  }
}

provider "cloudflare" {
  api_token = var.cloudflare_api_token
}

data "cloudflare_zone" "example" {
  name = "example.com"
}

resource "cloudflare_record" "apex" {
  zone_id = data.cloudflare_zone.example.id
  name    = "@"
  value   = "203.0.113.10"
  type    = "A"
  ttl     = 1  # 1 = automatic (Cloudflare-managed)
  proxied = true
}

resource "cloudflare_record" "www" {
  zone_id = data.cloudflare_zone.example.id
  name    = "www"
  value   = "example.com"
  type    = "CNAME"
  ttl     = 1
  proxied = true
}

# API subdomain — not proxied, direct DNS
resource "cloudflare_record" "api" {
  zone_id = data.cloudflare_zone.example.id
  name    = "api"
  value   = "203.0.113.20"
  type    = "A"
  ttl     = 300
  proxied = false
}

What IaC for DNS Gives You

Audit trail. Every change is a commit. Who changed what TTL, when, why — it's in git history.
Review process. DNS changes go through the same PR review as application code. A teammate catches the typo in the SPF record before it deploys.
Drift detection. terraform plan against your live DNS shows you if someone made a manual change through the console.
Reproducibility. Spin up an identical staging environment, including DNS records, with the same configuration.

The Practical Tradeoff

Terraform's DNS management works well for records that change infrequently. For highly dynamic records — health-check-driven routing, per-deployment feature flags via DNS — the Terraform apply cycle may be too slow. In those cases, you'd use the DNS provider's API directly or a more dynamic tool.

GitOps for DNS: OctoDNS

OctoDNS treats DNS zone data as source-of-truth configuration files, with support for syncing to multiple providers simultaneously. This is the right tool when you want multi-provider DNS managed from a single source.

Basic OctoDNS Configuration

Directory structure:

dns/
  config/
    octodns.yaml
  zones/
    example.com.yaml

octodns.yaml:

providers:
  config:
    class: octodns.provider.yaml.YamlProvider
    directory: ./zones
    default_ttl: 300
    enforce_order: true

  route53:
    class: octodns_route53.Route53Provider
    access_key_id: env/AWS_ACCESS_KEY_ID
    secret_access_key: env/AWS_SECRET_ACCESS_KEY

  cloudflare:
    class: octodns_cloudflare.CloudflareProvider
    token: env/CLOUDFLARE_TOKEN

zones:
  example.com.:
    sources:
      - config
    targets:
      - route53
      - cloudflare  # secondary provider, receives same records

zones/example.com.yaml:

---
# Apex A record
'':
  ttl: 300
  type: A
  values:
    - 203.0.113.10

# www CNAME
www:
  ttl: 300
  type: CNAME
  value: example.com.

# MX records
'':
  - ttl: 3600
    type: MX
    values:
      - priority: 10
        value: mail1.example.com.
      - priority: 20
        value: mail2.example.com.

# SPF
'':
  - ttl: 3600
    type: TXT
    value: "v=spf1 include:_spf.google.com ~all"

# DMARC
_dmarc:
  ttl: 3600
  type: TXT
  value: "v=DMARC1; p=reject; rua=mailto:dmarc-reports@example.com"

Run a dry-run sync:

octodns-sync --config-file config/octodns.yaml --doit --no-dry-run

Without --no-dry-run, OctoDNS prints what it would change without applying anything. Always run the dry-run first.

Workflow Integration

In practice, the OctoDNS repository becomes the source of truth. The CI pipeline:

PR opens with DNS change
CI runs octodns-sync --dry-run and posts the planned changes as a PR comment
Reviewer sees exactly which records will change before approving
Merge triggers octodns-sync --no-dry-run against both providers
Post-sync validation script queries both providers to confirm records are correct

External-DNS for Kubernetes

If you're running Kubernetes, external-dns bridges the gap between Kubernetes service objects and your DNS provider. It watches for Services and Ingress resources with annotations and creates the corresponding DNS records automatically.

Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
spec:
  replicas: 1
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
        - name: external-dns
          image: registry.k8s.io/external-dns/external-dns:v0.14.0
          args:
            - --source=service
            - --source=ingress
            - --domain-filter=example.com
            - --provider=cloudflare
            - --cloudflare-proxied
            - --policy=upsert-only  # never delete records automatically
            - --txt-owner-id=my-cluster
          env:
            - name: CF_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: cloudflare-credentials
                  key: api-token

An Ingress resource that external-dns will pick up:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    external-dns.alpha.kubernetes.io/hostname: app.example.com
    external-dns.alpha.kubernetes.io/ttl: "300"
spec:
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-app
                port:
                  number: 80

When this Ingress is created, external-dns creates the corresponding DNS record in Cloudflare. When the Ingress is deleted, external-dns removes it. The --policy=upsert-only flag is important in early adoption — it prevents automated deletions until you trust the system.

The Container DNS Problem

One thing external-dns doesn't solve: how your containers resolve DNS internally. This is a different problem, covered in Module 3.

The short version: Kubernetes's default ndots:5 resolver configuration means that a query for api.example.com from inside a pod will attempt:

api.example.com.default.svc.cluster.local
api.example.com.svc.cluster.local
api.example.com.cluster.local
api.example.com (the actual hostname)

Before it reaches the real DNS lookup. This adds latency and load to CoreDNS. Fix it:

# In your pod spec
dnsConfig:
  options:
    - name: ndots
      value: "1"

Or use fully qualified domain names (trailing dot) in your service URLs when calling external services.

CI/CD: DNS Changes in Deployment Pipelines

Blue/Green Deployments via DNS

The classic blue/green deployment uses DNS TTL manipulation to control traffic shifting. The DNS layer is the switch:

Two environments running: blue (current production) and green (new version)
app.example.com A record pointing to blue, TTL 300s
Pre-deployment: lower TTL to 60s, wait for current TTL to expire across resolvers
Deploy to green, validate
Change A record to point to green
Wait 60s for propagation
Monitor error rates on green
If healthy: raise TTL back to 300s; blue environment can be updated
If unhealthy: revert the A record to blue; recovery is 60 seconds

The critical step is waiting for the old TTL to expire before making the switch. If you lower TTL from 3600s to 60s and immediately change the A record, some resolvers will cache the old A record for up to 59 more minutes.

A GitHub Actions workflow fragment for this:

- name: Lower TTL pre-deployment
  run: |
    aws route53 change-resource-record-sets \
      --hosted-zone-id $ZONE_ID \
      --change-batch '{
        "Changes": [{
          "Action": "UPSERT",
          "ResourceRecordSet": {
            "Name": "app.example.com",
            "Type": "A",
            "TTL": 60,
            "ResourceRecords": [{"Value": "$BLUE_IP"}]
          }
        }]
      }'
    # Wait for old TTL (300s) to expire
    echo "Waiting 300 seconds for TTL propagation..."
    sleep 300

- name: Switch to green
  run: |
    aws route53 change-resource-record-sets \
      --hosted-zone-id $ZONE_ID \
      --change-batch '{
        "Changes": [{
          "Action": "UPSERT",
          "ResourceRecordSet": {
            "Name": "app.example.com",
            "Type": "A",
            "TTL": 60,
            "ResourceRecords": [{"Value": "$GREEN_IP"}]
          }
        }]
      }'

Secrets and DNS: Don't Leak Your Internal Topology

DNS can leak infrastructure information that you might not want public. When you set up Terraform or OctoDNS, think about what you're publishing:

Internal service names in public DNS (db.example.com pointing to a private IP)
Subdomain names that reveal your technology stack (wordpress.example.com, jenkins.example.com)
Development/staging endpoints with non-trivial data (staging-with-real-data.example.com)

The pattern to use is split-horizon DNS: different views of the DNS namespace for internal vs external resolvers.

In Terraform with Route 53:

# Public zone — what the internet sees
resource "aws_route53_zone" "public" {
  name = "example.com"
}

# Private zone — what your VPC sees
resource "aws_route53_zone" "private" {
  name = "example.com"

  vpc {
    vpc_id = aws_vpc.main.id
  }
}

# Public: only the app-facing records
resource "aws_route53_record" "app_public" {
  zone_id = aws_route53_zone.public.zone_id
  name    = "app.example.com"
  type    = "A"
  ttl     = 300
  records = [aws_lb.public.dns_name]
}

# Private: internal service discovery
resource "aws_route53_record" "db_private" {
  zone_id = aws_route53_zone.private.zone_id
  name    = "db.example.com"
  type    = "A"
  ttl     = 60
  records = [aws_db_instance.main.address]
}

resource "aws_route53_record" "cache_private" {
  zone_id = aws_route53_zone.private.zone_id
  name    = "cache.example.com"
  type    = "A"
  ttl     = 60
  records = [aws_elasticache_cluster.main.cache_nodes[0].address]
}

This way, db.example.com resolves correctly inside your VPC but doesn't exist in public DNS. The internal topology stays internal.

Monitoring Integration

Prometheus + DNS

The dnsdist and bind_exporter Prometheus exporters expose DNS metrics. But for most teams, the more useful monitoring is at the application/infrastructure level:

NXDOMAIN rate spike = something is querying for names that don't exist (could be a misconfigured service, could be a sign of malware)
SERVFAIL rate spike = authoritative server issues
Resolution latency P95 above threshold = resolver degradation
Query volume drop = traffic loss (if your app is querying DNS and suddenly stops, something upstream broke)

A Prometheus alert rule for NXDOMAIN rate:

groups:
  - name: dns_alerts
    rules:
      - alert: HighNXDOMAINRate
        expr: rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High NXDOMAIN rate on CoreDNS"
          description: "NXDOMAIN responses are above 10/s for 2 minutes. Check for misconfigured services."

      - alert: DNSResolutionLatencyHigh
        expr: histogram_quantile(0.95, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DNS resolution P95 latency above 100ms"

External DNS Health Checks

For authoritative DNS, monitoring from inside your infrastructure isn't enough — you need external validation that your domains are resolving correctly from the internet. Services like Catchpoint, Pingdom, or Uptime Robot can send DNS queries to your authoritative servers from multiple global locations and alert on failures or answer mismatches.

At minimum, configure external health checks for:

Your primary domain's A record
MX records (email delivery depends on this)
Any SOA serial checks to verify zone transfers are working if you run multi-provider

Key Takeaways

DNS-as-code with Terraform or OctoDNS gives you the same audit trail, review process, and reproducibility you expect from application deployments. Manual DNS changes are tech debt.
OctoDNS is the right tool for multi-provider authoritative DNS managed from a single source of truth.
External-dns handles the Kubernetes integration problem, but doesn't fix the container resolver configuration problem — those need separate attention.
Blue/green via DNS TTL works reliably if you respect the TTL expiry window before switching.
Split-horizon DNS in IaC keeps your internal topology out of public DNS. This is a security decision, not just an architecture preference.
Monitor NXDOMAIN and SERVFAIL rates. Both are early signals for problems that will become customer-visible if left unaddressed.

Up Next

Lesson 3 — Best Practices Synthesis: Every key recommendation from the course, organized by role — developer, ops/SRE, and domain manager.