Docs / Backup & Recovery / Backup Monitoring and Alerting

Backup Monitoring and Alerting

By Admin · Mar 15, 2026 · Updated Apr 24, 2026 · 394 views · 3 min read

Why Monitor Backups?

Backups that silently fail are worse than no backups because they create a false sense of security. Comprehensive monitoring ensures every job completes on time with expected data and alerts you immediately when something goes wrong.

Key Metrics

  • Completion status — Success or failure
  • Freshness — When was the last successful backup?
  • Size — Sudden drops may indicate missing data
  • Duration — Is the job taking longer than usual?
  • Storage utilization — Is the destination running out of space?

Healthchecks.io Dead Man Switch

#!/bin/bash
HC="https://hc-ping.com/YOUR-UUID"

curl -fsS -m 10 "${HC}/start" > /dev/null

if /usr/local/bin/backup.sh; then
    curl -fsS -m 10 "$HC" > /dev/null
else
    curl -fsS -m 10 "${HC}/fail" > /dev/null
fi

tail -c 10000 /var/log/backup.log | \
    curl -fsS -m 30 "${HC}/log" --data-binary @- > /dev/null

Prometheus Metrics

#!/bin/bash
METRICS="/var/lib/prometheus/node-exporter/backup.prom"
START=$(date +%s)

if restic -r /backup/repo backup /data; then
    STATUS=1
else
    STATUS=0
fi

DURATION=$(($(date +%s) - START))
SNAPS=$(restic -r /backup/repo snapshots --json 2>/dev/null | jq length)

cat > "$METRICS" << EOF
# HELP backup_success Whether last backup succeeded
# TYPE backup_success gauge
backup_success{type="daily"} $STATUS
# HELP backup_duration_seconds Backup duration
# TYPE backup_duration_seconds gauge
backup_duration_seconds{type="daily"} $DURATION
# HELP backup_snapshot_count Total snapshots
# TYPE backup_snapshot_count gauge
backup_snapshot_count{type="daily"} ${SNAPS:-0}
EOF

Alert Rules

groups:
  - name: backup_alerts
    rules:
      - alert: BackupFailed
        expr: backup_success == 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Backup failed on {{ $labels.instance }}"

      - alert: BackupStale
        expr: (time() - backup_last_timestamp) > 90000
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "No successful backup in 25 hours"

      - alert: BackupSizeDrop
        expr: backup_size < (backup_size offset 1d) * 0.5
        labels: { severity: warning }

Email Alerting Script

#!/bin/bash
check_freshness() {
    local name="$1" path="$2" max_hours="$3"
    [ ! -e "$path" ] && { echo "MISSING: $path" | mail -s "Backup Missing: $name" admin@example.com; return 1; }
    local age=$(( ($(date +%s) - $(stat -c %Y "$path")) / 3600 ))
    [ "$age" -gt "$max_hours" ] && { echo "Stale: ${age}h old" | mail -s "Backup Stale: $name" admin@example.com; return 1; }
}

check_freshness "MySQL" "/backup/mysql/latest.sql.gz" 25
check_freshness "Files" "/backup/files/latest/" 25

Grafana Dashboard Panels

  • Status Matrix — Table with each job, last success, status (green/red)
  • Duration Trends — Line graph of backup duration over time
  • Storage Usage — Gauge showing utilization percentage
  • Size History — Line graph to spot anomalies

Best Practices

  • Monitor both success AND absence with dead-man-switch monitoring
  • Set freshness alerts at 1.5x backup interval (25h for daily)
  • Track sizes over time to spot anomalies
  • Use multiple alert channels (email + Slack + PagerDuty)
  • Include backup monitoring in incident response procedures
  • Review dashboards weekly
  • Automate restore testing and include results in monitoring

Was this article helpful?