DNS failover automatically redirects traffic from a failed server to a healthy backup by updating DNS records in response to health check failures. This guide covers building a DNS failover system using both self-hosted solutions and cloud DNS services for automatic failover with minimal downtime.
DNS Failover Approaches
- Active health checks — monitor server health and update DNS records automatically
- Multiple A records — let clients retry failed servers (basic, unreliable)
- Cloud DNS failover — AWS Route 53, Cloudflare, etc. with built-in health checks
Self-Hosted DNS Failover with Health Checks
#!/bin/bash
# /usr/local/bin/dns-failover.sh
# Monitor primary server and update DNS on failure
PRIMARY_IP="203.0.113.1"
BACKUP_IP="203.0.113.2"
HOSTNAME="www.example.com"
DNS_SERVER="ns1.example.com"
TSIG_KEY="/etc/dns/failover-key.conf"
CHECK_URL="https://${PRIMARY_IP}/health"
LOG="/var/log/dns-failover.log"
STATE_FILE="/var/run/dns-failover-state"
# Current state
CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "primary")
# Health check
check_health() {
local ip=$1
# Check HTTP health endpoint
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "http://${ip}/health" 2>/dev/null)
[ "$HTTP_CODE" = "200" ] && return 0 || return 1
}
# Consecutive failure tracking
FAIL_FILE="/var/run/dns-failover-fails"
MAX_FAILS=3
PRIMARY_HEALTHY=true
if ! check_health "$PRIMARY_IP"; then
FAILS=$(cat "$FAIL_FILE" 2>/dev/null || echo 0)
FAILS=$((FAILS + 1))
echo "$FAILS" > "$FAIL_FILE"
if [ "$FAILS" -ge "$MAX_FAILS" ]; then
PRIMARY_HEALTHY=false
fi
else
echo 0 > "$FAIL_FILE"
fi
# Failover logic
if [ "$PRIMARY_HEALTHY" = false ] && [ "$CURRENT_STATE" = "primary" ]; then
echo "[$(date)] PRIMARY DOWN - Failing over to backup ($BACKUP_IP)" >> "$LOG"
nsupdate -k "$TSIG_KEY" "$STATE_FILE"
elif [ "$PRIMARY_HEALTHY" = true ] && [ "$CURRENT_STATE" = "backup" ]; then
echo "[$(date)] PRIMARY RECOVERED - Failing back to primary ($PRIMARY_IP)" >> "$LOG"
nsupdate -k "$TSIG_KEY" "$STATE_FILE"
fi
# Run every 30 seconds via cron or systemd timer
* * * * * /usr/local/bin/dns-failover.sh
* * * * * sleep 30 && /usr/local/bin/dns-failover.sh
AWS Route 53 Health Checks
# Create health check
aws route53 create-health-check --caller-reference $(date +%s) --health-check-config '{
"IPAddress": "203.0.113.1",
"Port": 443,
"Type": "HTTPS",
"ResourcePath": "/health",
"RequestInterval": 10,
"FailureThreshold": 3
}'
# Create failover records
aws route53 change-resource-record-sets --hosted-zone-id Z1234 --change-batch '{
"Changes": [
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "www.example.com",
"Type": "A",
"SetIdentifier": "primary",
"Failover": "PRIMARY",
"HealthCheckId": "health-check-id",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.1"}]
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "www.example.com",
"Type": "A",
"SetIdentifier": "backup",
"Failover": "SECONDARY",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.2"}]
}
}
]
}'
Cloudflare Load Balancing with Failover
# Cloudflare provides DNS-based load balancing with health checks
# Configure via dashboard or API:
# 1. Create a health monitor
# Type: HTTPS, Path: /health, Interval: 60s, Retries: 2
# 2. Create a pool with two origins
# Primary: 203.0.113.1 (weight: 1)
# Backup: 203.0.113.2 (weight: 0, only used on failover)
# 3. Create a load balancer
# Record: www.example.com
# Steering: failover (route to first healthy pool)
# Fallback pool: backup
Failover Considerations
- TTL impact — DNS failover is limited by TTL caching. Use 30-60 second TTL for failover records
- Health check frequency — check every 10-30 seconds; require 3 consecutive failures before failover
- Failback — decide whether to automatically fail back or require manual intervention
- Session persistence — users with cached DNS may still connect to the failed server during TTL window
Best Practices
- Use low TTL (30-60 seconds) for records under failover management
- Require multiple consecutive failures before triggering failover to avoid flapping
- Monitor and alert on failover events — they indicate infrastructure problems
- Test failover regularly by simulating primary server failure
- Consider whether automatic failback is safe or if manual verification is needed
- Use cloud DNS services for production failover — they have global health check infrastructure
- DNS failover has inherent delay (TTL + detection time) — for faster failover, use IP anycast or load balancers