Docs / Backup & Recovery / Disaster Recovery with Failover Servers

Disaster Recovery with Failover Servers

By Admin · Mar 15, 2026 · Updated Apr 24, 2026 · 712 views · 3 min read

Understanding Disaster Recovery

Disaster recovery (DR) restores infrastructure and data after catastrophic events. A well-planned strategy with failover servers can reduce Recovery Time Objective (RTO) from hours to minutes.

Key DR Metrics

  • RTO — Maximum acceptable downtime
  • RPO — Maximum acceptable data loss measured in time
  • MTTR — Average time to restore service

Database Replication Setup

# PRIMARY - /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
bind-address = 0.0.0.0

# Create replication user
CREATE USER 'repl'@'DR_IP' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'DR_IP';

# DR SERVER
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = 1

CHANGE REPLICATION SOURCE TO
    SOURCE_HOST='PRIMARY_IP', SOURCE_USER='repl',
    SOURCE_PASSWORD='strong_password',
    SOURCE_LOG_FILE='mysql-bin.000001', SOURCE_LOG_POS=154;
START REPLICA;

File Sync with lsyncd

sudo apt install -y lsyncd

cat > /etc/lsyncd/lsyncd.conf.lua << EOF
settings { logfile="/var/log/lsyncd.log" }
sync {
    default.rsync,
    source="/var/www", target="DR_IP:/var/www",
    delay=5,
    rsync={ archive=true, compress=true,
        rsh="/usr/bin/ssh -i /root/.ssh/dr_key" }
}
EOF
systemctl enable --now lsyncd

Automated Failover

#!/bin/bash
# /usr/local/bin/dr-failover.sh
PRIMARY_IP="203.0.113.10"
DR_IP="203.0.113.20"
DOMAIN="myapp.example.com"
CF_API_TOKEN="your-token"
CF_ZONE_ID="your-zone"
MAX_FAILURES=3

check_primary() {
    status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
        "http://${PRIMARY_IP}/health" 2>/dev/null || echo "000")
    [ "$status" = "200" ]
}

perform_failover() {
    echo "FAILOVER: Promoting DR server"
    ssh root@$DR_IP "mysql -e 'STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;'"

    # Update Cloudflare DNS
    RECORD_ID=$(curl -s "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?name=$DOMAIN" \
        -H "Authorization: Bearer $CF_API_TOKEN" | jq -r '.result[0].id')
    curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$RECORD_ID" \
        -H "Authorization: Bearer $CF_API_TOKEN" -H "Content-Type: application/json" \
        --data "{\"type\":\"A\",\"name\":\"$DOMAIN\",\"content\":\"$DR_IP\",\"ttl\":60,\"proxied\":true}"

    ssh root@$DR_IP "systemctl start nginx php8.2-fpm redis-server"
}

FAILURES=0
while true; do
    if check_primary; then FAILURES=0
    else
        FAILURES=$((FAILURES + 1))
        [ $FAILURES -ge $MAX_FAILURES ] && { perform_failover; exit 0; }
    fi
    sleep 30
done

Failback Checklist

  1. Enable maintenance mode on DR server
  2. Sync data from DR back to primary (mysqldump + rsync)
  3. Update DNS back to primary IP
  4. Re-establish DR as replica
  5. Remove maintenance mode

Testing Your DR Plan

  • Schedule quarterly DR drills with actual failover during maintenance windows
  • Document every step with exact commands and expected outcomes
  • Measure actual RTO and RPO during drills
  • Maintain DR runbook accessible outside primary infrastructure

Best Practices

  • Use geographically separate DR location
  • Automate failover but require human confirmation for ambiguous cases
  • Monitor replication lag continuously
  • Keep DR server software synchronized with primary
  • Use low-TTL DNS (60-300s) for fast propagation during failover

Was this article helpful?