Disaster Recovery with Failover Servers

By Admin · Mar 15, 2026 · Updated Jun 25, 2026 · 742 views · 3 min read

Understanding Disaster Recovery

Disaster recovery (DR) restores infrastructure and data after catastrophic events. A well-planned strategy with failover servers can reduce Recovery Time Objective (RTO) from hours to minutes.

Key DR Metrics

RTO — Maximum acceptable downtime
RPO — Maximum acceptable data loss measured in time
MTTR — Average time to restore service

Database Replication Setup

# PRIMARY - /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
bind-address = 0.0.0.0

# Create replication user
CREATE USER 'repl'@'DR_IP' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'DR_IP';

# DR SERVER
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = 1

CHANGE REPLICATION SOURCE TO
    SOURCE_HOST='PRIMARY_IP', SOURCE_USER='repl',
    SOURCE_PASSWORD='strong_password',
    SOURCE_LOG_FILE='mysql-bin.000001', SOURCE_LOG_POS=154;
START REPLICA;

File Sync with lsyncd

sudo apt install -y lsyncd

cat > /etc/lsyncd/lsyncd.conf.lua << EOF
settings { logfile="/var/log/lsyncd.log" }
sync {
    default.rsync,
    source="/var/www", target="DR_IP:/var/www",
    delay=5,
    rsync={ archive=true, compress=true,
        rsh="/usr/bin/ssh -i /root/.ssh/dr_key" }
}
EOF
systemctl enable --now lsyncd

Automated Failover

#!/bin/bash
# /usr/local/bin/dr-failover.sh
PRIMARY_IP="203.0.113.10"
DR_IP="203.0.113.20"
DOMAIN="myapp.example.com"
CF_API_TOKEN="your-token"
CF_ZONE_ID="your-zone"
MAX_FAILURES=3

check_primary() {
    status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
        "http://${PRIMARY_IP}/health" 2>/dev/null || echo "000")
    [ "$status" = "200" ]
}

perform_failover() {
    echo "FAILOVER: Promoting DR server"
    ssh root@$DR_IP "mysql -e 'STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;'"

    # Update Cloudflare DNS
    RECORD_ID=$(curl -s "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?name=$DOMAIN" \
        -H "Authorization: Bearer $CF_API_TOKEN" | jq -r '.result[0].id')
    curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$RECORD_ID" \
        -H "Authorization: Bearer $CF_API_TOKEN" -H "Content-Type: application/json" \
        --data "{\"type\":\"A\",\"name\":\"$DOMAIN\",\"content\":\"$DR_IP\",\"ttl\":60,\"proxied\":true}"

    ssh root@$DR_IP "systemctl start nginx php8.2-fpm redis-server"
}

FAILURES=0
while true; do
    if check_primary; then FAILURES=0
    else
        FAILURES=$((FAILURES + 1))
        [ $FAILURES -ge $MAX_FAILURES ] && { perform_failover; exit 0; }
    fi
    sleep 30
done

Failback Checklist

Enable maintenance mode on DR server
Sync data from DR back to primary (mysqldump + rsync)
Update DNS back to primary IP
Re-establish DR as replica
Remove maintenance mode

Testing Your DR Plan

Schedule quarterly DR drills with actual failover during maintenance windows
Document every step with exact commands and expected outcomes
Measure actual RTO and RPO during drills
Maintain DR runbook accessible outside primary infrastructure

Best Practices

Use geographically separate DR location
Automate failover but require human confirmation for ambiguous cases
Monitor replication lag continuously
Keep DR server software synchronized with primary
Use low-TTL DNS (60-300s) for fast propagation during failover