Docs / DNS & Domains / Building a DNS Failover System

Building a DNS Failover System

By Admin · Mar 15, 2026 · Updated Apr 24, 2026 · 512 views · 3 min read

DNS failover automatically redirects traffic from a failed server to a healthy backup by updating DNS records in response to health check failures. This guide covers building a DNS failover system using both self-hosted solutions and cloud DNS services for automatic failover with minimal downtime.

DNS Failover Approaches

  • Active health checks — monitor server health and update DNS records automatically
  • Multiple A records — let clients retry failed servers (basic, unreliable)
  • Cloud DNS failover — AWS Route 53, Cloudflare, etc. with built-in health checks

Self-Hosted DNS Failover with Health Checks

#!/bin/bash
# /usr/local/bin/dns-failover.sh
# Monitor primary server and update DNS on failure

PRIMARY_IP="203.0.113.1"
BACKUP_IP="203.0.113.2"
HOSTNAME="www.example.com"
DNS_SERVER="ns1.example.com"
TSIG_KEY="/etc/dns/failover-key.conf"
CHECK_URL="https://${PRIMARY_IP}/health"
LOG="/var/log/dns-failover.log"
STATE_FILE="/var/run/dns-failover-state"

# Current state
CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "primary")

# Health check
check_health() {
    local ip=$1
    # Check HTTP health endpoint
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "http://${ip}/health" 2>/dev/null)
    [ "$HTTP_CODE" = "200" ] && return 0 || return 1
}

# Consecutive failure tracking
FAIL_FILE="/var/run/dns-failover-fails"
MAX_FAILS=3

PRIMARY_HEALTHY=true
if ! check_health "$PRIMARY_IP"; then
    FAILS=$(cat "$FAIL_FILE" 2>/dev/null || echo 0)
    FAILS=$((FAILS + 1))
    echo "$FAILS" > "$FAIL_FILE"

    if [ "$FAILS" -ge "$MAX_FAILS" ]; then
        PRIMARY_HEALTHY=false
    fi
else
    echo 0 > "$FAIL_FILE"
fi

# Failover logic
if [ "$PRIMARY_HEALTHY" = false ] && [ "$CURRENT_STATE" = "primary" ]; then
    echo "[$(date)] PRIMARY DOWN - Failing over to backup ($BACKUP_IP)" >> "$LOG"
    nsupdate -k "$TSIG_KEY"  "$STATE_FILE"

elif [ "$PRIMARY_HEALTHY" = true ] && [ "$CURRENT_STATE" = "backup" ]; then
    echo "[$(date)] PRIMARY RECOVERED - Failing back to primary ($PRIMARY_IP)" >> "$LOG"
    nsupdate -k "$TSIG_KEY"  "$STATE_FILE"
fi
# Run every 30 seconds via cron or systemd timer
* * * * * /usr/local/bin/dns-failover.sh
* * * * * sleep 30 && /usr/local/bin/dns-failover.sh

AWS Route 53 Health Checks

# Create health check
aws route53 create-health-check --caller-reference $(date +%s) --health-check-config '{
    "IPAddress": "203.0.113.1",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3
}'

# Create failover records
aws route53 change-resource-record-sets --hosted-zone-id Z1234 --change-batch '{
    "Changes": [
        {
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "www.example.com",
                "Type": "A",
                "SetIdentifier": "primary",
                "Failover": "PRIMARY",
                "HealthCheckId": "health-check-id",
                "TTL": 60,
                "ResourceRecords": [{"Value": "203.0.113.1"}]
            }
        },
        {
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "www.example.com",
                "Type": "A",
                "SetIdentifier": "backup",
                "Failover": "SECONDARY",
                "TTL": 60,
                "ResourceRecords": [{"Value": "203.0.113.2"}]
            }
        }
    ]
}'

Cloudflare Load Balancing with Failover

# Cloudflare provides DNS-based load balancing with health checks
# Configure via dashboard or API:

# 1. Create a health monitor
# Type: HTTPS, Path: /health, Interval: 60s, Retries: 2

# 2. Create a pool with two origins
# Primary: 203.0.113.1 (weight: 1)
# Backup: 203.0.113.2 (weight: 0, only used on failover)

# 3. Create a load balancer
# Record: www.example.com
# Steering: failover (route to first healthy pool)
# Fallback pool: backup

Failover Considerations

  • TTL impact — DNS failover is limited by TTL caching. Use 30-60 second TTL for failover records
  • Health check frequency — check every 10-30 seconds; require 3 consecutive failures before failover
  • Failback — decide whether to automatically fail back or require manual intervention
  • Session persistence — users with cached DNS may still connect to the failed server during TTL window

Best Practices

  • Use low TTL (30-60 seconds) for records under failover management
  • Require multiple consecutive failures before triggering failover to avoid flapping
  • Monitor and alert on failover events — they indicate infrastructure problems
  • Test failover regularly by simulating primary server failure
  • Consider whether automatic failback is safe or if manual verification is needed
  • Use cloud DNS services for production failover — they have global health check infrastructure
  • DNS failover has inherent delay (TTL + detection time) — for faster failover, use IP anycast or load balancers

Was this article helpful?