PagerDuty integration with Prometheus Alertmanager creates a robust on-call alerting pipeline — Prometheus detects issues, Alertmanager routes and deduplicates alerts, and PagerDuty handles escalation, on-call scheduling, and notification delivery. This guide covers configuring the complete alerting pipeline.
Alertmanager Configuration
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
receiver: default
group_by: [alertname, instance]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts → PagerDuty (immediate)
- match:
severity: critical
receiver: pagerduty-critical
repeat_interval: 1h
# Warning alerts → PagerDuty (low urgency)
- match:
severity: warning
receiver: pagerduty-warning
repeat_interval: 4h
# Info alerts → Slack only
- match:
severity: info
receiver: slack-info
receivers:
- name: default
slack_configs:
- api_url: "https://hooks.slack.com/services/xxx"
channel: "#alerts"
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "your-pagerduty-integration-key"
severity: critical
description: '{{ template "pagerduty.default.description" . }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
slack_configs:
- api_url: "https://hooks.slack.com/services/xxx"
channel: "#alerts-critical"
- name: pagerduty-warning
pagerduty_configs:
- routing_key: "your-pagerduty-integration-key"
severity: warning
- name: slack-info
slack_configs:
- api_url: "https://hooks.slack.com/services/xxx"
channel: "#alerts-info"
inhibit_rules:
# If critical is firing, suppress warning for same alertname
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, instance]
PagerDuty Setup
- In PagerDuty: Services → New Service → "Production Monitoring"
- Integration: select "Prometheus" or "Events API v2"
- Copy the Integration Key (routing_key)
- Configure escalation policy and on-call schedule
Alert Rules
# /etc/prometheus/rules/alerts.yml
groups:
- name: infrastructure
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been unreachable for 2 minutes"
- alert: HighCPU
expr: instance:node_cpu_utilization:ratio > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanizePercentage }}"
- alert: DiskAlmostFull
expr: instance:node_filesystem_utilization:ratio > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "Disk usage is {{ $value | humanizePercentage }}"
- alert: HighMemory
expr: instance:node_memory_utilization:ratio > 0.9
for: 10m
labels:
severity: warning
Custom PagerDuty Templates
# /etc/alertmanager/templates/pagerduty.tmpl
{{ define "pagerduty.default.description" }}
{{ range .Alerts.Firing }}
[FIRING] {{ .Labels.alertname }}: {{ .Annotations.summary }}
{{ end }}
{{ range .Alerts.Resolved }}
[RESOLVED] {{ .Labels.alertname }}: {{ .Annotations.summary }}
{{ end }}
{{ end }}
{{ define "pagerduty.default.instances" }}
{{ range . }}
- {{ .Labels.instance }}: {{ .Annotations.description }}
{{ end }}
{{ end }}
Testing the Pipeline
# Send a test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "critical",
"instance": "test-server:9100"
},
"annotations": {
"summary": "Test alert for PagerDuty integration",
"description": "This is a test alert to verify the alerting pipeline"
}
}]'
# Verify in PagerDuty that an incident was created
# Then resolve:
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "critical", "instance": "test-server:9100"},
"endsAt": "2025-01-15T00:00:00Z"
}]'
Best Practices
- Use severity labels consistently — critical for immediate action, warning for investigation
- Set appropriate
fordurations — do not page on transient issues - Use inhibition rules to suppress related lower-severity alerts
- Send critical alerts to PagerDuty AND Slack for visibility
- Set repeat_interval based on severity — critical every 1h, warning every 4h
- Test the alerting pipeline monthly to ensure it works end-to-end
- Include runbook links in alert annotations for faster incident response