Why Monitor Backups?
Backups that silently fail are worse than no backups because they create a false sense of security. Comprehensive monitoring ensures every job completes on time with expected data and alerts you immediately when something goes wrong.
Key Metrics
- Completion status — Success or failure
- Freshness — When was the last successful backup?
- Size — Sudden drops may indicate missing data
- Duration — Is the job taking longer than usual?
- Storage utilization — Is the destination running out of space?
Healthchecks.io Dead Man Switch
#!/bin/bash
HC="https://hc-ping.com/YOUR-UUID"
curl -fsS -m 10 "${HC}/start" > /dev/null
if /usr/local/bin/backup.sh; then
curl -fsS -m 10 "$HC" > /dev/null
else
curl -fsS -m 10 "${HC}/fail" > /dev/null
fi
tail -c 10000 /var/log/backup.log | \
curl -fsS -m 30 "${HC}/log" --data-binary @- > /dev/null
Prometheus Metrics
#!/bin/bash
METRICS="/var/lib/prometheus/node-exporter/backup.prom"
START=$(date +%s)
if restic -r /backup/repo backup /data; then
STATUS=1
else
STATUS=0
fi
DURATION=$(($(date +%s) - START))
SNAPS=$(restic -r /backup/repo snapshots --json 2>/dev/null | jq length)
cat > "$METRICS" << EOF
# HELP backup_success Whether last backup succeeded
# TYPE backup_success gauge
backup_success{type="daily"} $STATUS
# HELP backup_duration_seconds Backup duration
# TYPE backup_duration_seconds gauge
backup_duration_seconds{type="daily"} $DURATION
# HELP backup_snapshot_count Total snapshots
# TYPE backup_snapshot_count gauge
backup_snapshot_count{type="daily"} ${SNAPS:-0}
EOF
Alert Rules
groups:
- name: backup_alerts
rules:
- alert: BackupFailed
expr: backup_success == 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Backup failed on {{ $labels.instance }}"
- alert: BackupStale
expr: (time() - backup_last_timestamp) > 90000
for: 5m
labels: { severity: critical }
annotations:
summary: "No successful backup in 25 hours"
- alert: BackupSizeDrop
expr: backup_size < (backup_size offset 1d) * 0.5
labels: { severity: warning }
Email Alerting Script
#!/bin/bash
check_freshness() {
local name="$1" path="$2" max_hours="$3"
[ ! -e "$path" ] && { echo "MISSING: $path" | mail -s "Backup Missing: $name" admin@example.com; return 1; }
local age=$(( ($(date +%s) - $(stat -c %Y "$path")) / 3600 ))
[ "$age" -gt "$max_hours" ] && { echo "Stale: ${age}h old" | mail -s "Backup Stale: $name" admin@example.com; return 1; }
}
check_freshness "MySQL" "/backup/mysql/latest.sql.gz" 25
check_freshness "Files" "/backup/files/latest/" 25
Grafana Dashboard Panels
- Status Matrix — Table with each job, last success, status (green/red)
- Duration Trends — Line graph of backup duration over time
- Storage Usage — Gauge showing utilization percentage
- Size History — Line graph to spot anomalies
Best Practices
- Monitor both success AND absence with dead-man-switch monitoring
- Set freshness alerts at 1.5x backup interval (25h for daily)
- Track sizes over time to spot anomalies
- Use multiple alert channels (email + Slack + PagerDuty)
- Include backup monitoring in incident response procedures
- Review dashboards weekly
- Automate restore testing and include results in monitoring