Understanding Disaster Recovery
Disaster recovery (DR) restores infrastructure and data after catastrophic events. A well-planned strategy with failover servers can reduce Recovery Time Objective (RTO) from hours to minutes.
Key DR Metrics
- RTO — Maximum acceptable downtime
- RPO — Maximum acceptable data loss measured in time
- MTTR — Average time to restore service
Database Replication Setup
# PRIMARY - /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
bind-address = 0.0.0.0
# Create replication user
CREATE USER 'repl'@'DR_IP' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'DR_IP';
# DR SERVER
[mysqld]
server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
read_only = 1
CHANGE REPLICATION SOURCE TO
SOURCE_HOST='PRIMARY_IP', SOURCE_USER='repl',
SOURCE_PASSWORD='strong_password',
SOURCE_LOG_FILE='mysql-bin.000001', SOURCE_LOG_POS=154;
START REPLICA;
File Sync with lsyncd
sudo apt install -y lsyncd
cat > /etc/lsyncd/lsyncd.conf.lua << EOF
settings { logfile="/var/log/lsyncd.log" }
sync {
default.rsync,
source="/var/www", target="DR_IP:/var/www",
delay=5,
rsync={ archive=true, compress=true,
rsh="/usr/bin/ssh -i /root/.ssh/dr_key" }
}
EOF
systemctl enable --now lsyncd
Automated Failover
#!/bin/bash
# /usr/local/bin/dr-failover.sh
PRIMARY_IP="203.0.113.10"
DR_IP="203.0.113.20"
DOMAIN="myapp.example.com"
CF_API_TOKEN="your-token"
CF_ZONE_ID="your-zone"
MAX_FAILURES=3
check_primary() {
status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
"http://${PRIMARY_IP}/health" 2>/dev/null || echo "000")
[ "$status" = "200" ]
}
perform_failover() {
echo "FAILOVER: Promoting DR server"
ssh root@$DR_IP "mysql -e 'STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;'"
# Update Cloudflare DNS
RECORD_ID=$(curl -s "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?name=$DOMAIN" \
-H "Authorization: Bearer $CF_API_TOKEN" | jq -r '.result[0].id')
curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$RECORD_ID" \
-H "Authorization: Bearer $CF_API_TOKEN" -H "Content-Type: application/json" \
--data "{\"type\":\"A\",\"name\":\"$DOMAIN\",\"content\":\"$DR_IP\",\"ttl\":60,\"proxied\":true}"
ssh root@$DR_IP "systemctl start nginx php8.2-fpm redis-server"
}
FAILURES=0
while true; do
if check_primary; then FAILURES=0
else
FAILURES=$((FAILURES + 1))
[ $FAILURES -ge $MAX_FAILURES ] && { perform_failover; exit 0; }
fi
sleep 30
done
Failback Checklist
- Enable maintenance mode on DR server
- Sync data from DR back to primary (mysqldump + rsync)
- Update DNS back to primary IP
- Re-establish DR as replica
- Remove maintenance mode
Testing Your DR Plan
- Schedule quarterly DR drills with actual failover during maintenance windows
- Document every step with exact commands and expected outcomes
- Measure actual RTO and RPO during drills
- Maintain DR runbook accessible outside primary infrastructure
Best Practices
- Use geographically separate DR location
- Automate failover but require human confirmation for ambiguous cases
- Monitor replication lag continuously
- Keep DR server software synchronized with primary
- Use low-TTL DNS (60-300s) for fast propagation during failover