Backup Monitoring and Alerting

By Admin · Mar 15, 2026 · Updated Jun 23, 2026 · 426 views · 3 min read

Why Monitor Backups?

Backups that silently fail are worse than no backups because they create a false sense of security. Comprehensive monitoring ensures every job completes on time with expected data and alerts you immediately when something goes wrong.

Key Metrics

Completion status — Success or failure
Freshness — When was the last successful backup?
Size — Sudden drops may indicate missing data
Duration — Is the job taking longer than usual?
Storage utilization — Is the destination running out of space?

Healthchecks.io Dead Man Switch

#!/bin/bash
HC="https://hc-ping.com/YOUR-UUID"

curl -fsS -m 10 "${HC}/start" > /dev/null

if /usr/local/bin/backup.sh; then
    curl -fsS -m 10 "$HC" > /dev/null
else
    curl -fsS -m 10 "${HC}/fail" > /dev/null
fi

tail -c 10000 /var/log/backup.log | \
    curl -fsS -m 30 "${HC}/log" --data-binary @- > /dev/null

Prometheus Metrics

#!/bin/bash
METRICS="/var/lib/prometheus/node-exporter/backup.prom"
START=$(date +%s)

if restic -r /backup/repo backup /data; then
    STATUS=1
else
    STATUS=0
fi

DURATION=$(($(date +%s) - START))
SNAPS=$(restic -r /backup/repo snapshots --json 2>/dev/null | jq length)

cat > "$METRICS" << EOF
# HELP backup_success Whether last backup succeeded
# TYPE backup_success gauge
backup_success{type="daily"} $STATUS
# HELP backup_duration_seconds Backup duration
# TYPE backup_duration_seconds gauge
backup_duration_seconds{type="daily"} $DURATION
# HELP backup_snapshot_count Total snapshots
# TYPE backup_snapshot_count gauge
backup_snapshot_count{type="daily"} ${SNAPS:-0}
EOF

Alert Rules

groups:
  - name: backup_alerts
    rules:
      - alert: BackupFailed
        expr: backup_success == 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Backup failed on {{ $labels.instance }}"

      - alert: BackupStale
        expr: (time() - backup_last_timestamp) > 90000
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "No successful backup in 25 hours"

      - alert: BackupSizeDrop
        expr: backup_size < (backup_size offset 1d) * 0.5
        labels: { severity: warning }

Email Alerting Script

#!/bin/bash
check_freshness() {
    local name="$1" path="$2" max_hours="$3"
    [ ! -e "$path" ] && { echo "MISSING: $path" | mail -s "Backup Missing: $name" admin@example.com; return 1; }
    local age=$(( ($(date +%s) - $(stat -c %Y "$path")) / 3600 ))
    [ "$age" -gt "$max_hours" ] && { echo "Stale: ${age}h old" | mail -s "Backup Stale: $name" admin@example.com; return 1; }
}

check_freshness "MySQL" "/backup/mysql/latest.sql.gz" 25
check_freshness "Files" "/backup/files/latest/" 25

Grafana Dashboard Panels

Status Matrix — Table with each job, last success, status (green/red)
Duration Trends — Line graph of backup duration over time
Storage Usage — Gauge showing utilization percentage
Size History — Line graph to spot anomalies

Best Practices

Monitor both success AND absence with dead-man-switch monitoring
Set freshness alerts at 1.5x backup interval (25h for daily)
Track sizes over time to spot anomalies
Use multiple alert channels (email + Slack + PagerDuty)
Include backup monitoring in incident response procedures
Review dashboards weekly
Automate restore testing and include results in monitoring