Building a DNS Failover System

By Admin · Mar 15, 2026 · Updated Apr 24, 2026 · 512 views · 3 min read

DNS failover automatically redirects traffic from a failed server to a healthy backup by updating DNS records in response to health check failures. This guide covers building a DNS failover system using both self-hosted solutions and cloud DNS services for automatic failover with minimal downtime.

DNS Failover Approaches

Active health checks — monitor server health and update DNS records automatically
Multiple A records — let clients retry failed servers (basic, unreliable)
Cloud DNS failover — AWS Route 53, Cloudflare, etc. with built-in health checks

Self-Hosted DNS Failover with Health Checks

#!/bin/bash
# /usr/local/bin/dns-failover.sh
# Monitor primary server and update DNS on failure

PRIMARY_IP="203.0.113.1"
BACKUP_IP="203.0.113.2"
HOSTNAME="www.example.com"
DNS_SERVER="ns1.example.com"
TSIG_KEY="/etc/dns/failover-key.conf"
CHECK_URL="https://${PRIMARY_IP}/health"
LOG="/var/log/dns-failover.log"
STATE_FILE="/var/run/dns-failover-state"

# Current state
CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "primary")

# Health check
check_health() {
    local ip=$1
    # Check HTTP health endpoint
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "http://${ip}/health" 2>/dev/null)
    [ "$HTTP_CODE" = "200" ] && return 0 || return 1
}

# Consecutive failure tracking
FAIL_FILE="/var/run/dns-failover-fails"
MAX_FAILS=3

PRIMARY_HEALTHY=true
if ! check_health "$PRIMARY_IP"; then
    FAILS=$(cat "$FAIL_FILE" 2>/dev/null || echo 0)
    FAILS=$((FAILS + 1))
    echo "$FAILS" > "$FAIL_FILE"

    if [ "$FAILS" -ge "$MAX_FAILS" ]; then
        PRIMARY_HEALTHY=false
    fi
else
    echo 0 > "$FAIL_FILE"
fi

# Failover logic
if [ "$PRIMARY_HEALTHY" = false ] && [ "$CURRENT_STATE" = "primary" ]; then
    echo "[$(date)] PRIMARY DOWN - Failing over to backup ($BACKUP_IP)" >> "$LOG"
    nsupdate -k "$TSIG_KEY"  "$STATE_FILE"

elif [ "$PRIMARY_HEALTHY" = true ] && [ "$CURRENT_STATE" = "backup" ]; then
    echo "[$(date)] PRIMARY RECOVERED - Failing back to primary ($PRIMARY_IP)" >> "$LOG"
    nsupdate -k "$TSIG_KEY"  "$STATE_FILE"
fi

# Run every 30 seconds via cron or systemd timer
* * * * * /usr/local/bin/dns-failover.sh
* * * * * sleep 30 && /usr/local/bin/dns-failover.sh

AWS Route 53 Health Checks

# Create health check
aws route53 create-health-check --caller-reference $(date +%s) --health-check-config '{
    "IPAddress": "203.0.113.1",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3
}'

# Create failover records
aws route53 change-resource-record-sets --hosted-zone-id Z1234 --change-batch '{
    "Changes": [
        {
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "www.example.com",
                "Type": "A",
                "SetIdentifier": "primary",
                "Failover": "PRIMARY",
                "HealthCheckId": "health-check-id",
                "TTL": 60,
                "ResourceRecords": [{"Value": "203.0.113.1"}]
            }
        },
        {
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "www.example.com",
                "Type": "A",
                "SetIdentifier": "backup",
                "Failover": "SECONDARY",
                "TTL": 60,
                "ResourceRecords": [{"Value": "203.0.113.2"}]
            }
        }
    ]
}'

Cloudflare Load Balancing with Failover

# Cloudflare provides DNS-based load balancing with health checks
# Configure via dashboard or API:

# 1. Create a health monitor
# Type: HTTPS, Path: /health, Interval: 60s, Retries: 2

# 2. Create a pool with two origins
# Primary: 203.0.113.1 (weight: 1)
# Backup: 203.0.113.2 (weight: 0, only used on failover)

# 3. Create a load balancer
# Record: www.example.com
# Steering: failover (route to first healthy pool)
# Fallback pool: backup

Failover Considerations

TTL impact — DNS failover is limited by TTL caching. Use 30-60 second TTL for failover records
Health check frequency — check every 10-30 seconds; require 3 consecutive failures before failover
Failback — decide whether to automatically fail back or require manual intervention
Session persistence — users with cached DNS may still connect to the failed server during TTL window

Best Practices

Use low TTL (30-60 seconds) for records under failover management
Require multiple consecutive failures before triggering failover to avoid flapping
Monitor and alert on failover events — they indicate infrastructure problems
Test failover regularly by simulating primary server failure
Consider whether automatic failback is safe or if manual verification is needed
Use cloud DNS services for production failover — they have global health check infrastructure
DNS failover has inherent delay (TTL + detection time) — for faster failover, use IP anycast or load balancers