How to Set Up Infrastructure Monitoring with Prometheus and Alertmanager
Prometheus is an open-source monitoring and alerting toolkit designed for reliability. Paired with Alertmanager, it provides a complete monitoring solution for your Breeze infrastructure, collecting metrics, evaluating alert rules, and sending notifications when things go wrong.
Installing Prometheus
Download and install Prometheus on a dedicated monitoring Breeze instance:
# Create a system user
sudo useradd --no-create-home --shell /bin/false prometheus
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
# Download and extract
PROM_VERSION="2.51.0"
wget "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz"
tar xzf "prometheus-${PROM_VERSION}.linux-amd64.tar.gz"
cd "prometheus-${PROM_VERSION}.linux-amd64"
# Install binaries
sudo cp prometheus promtool /usr/local/bin/
sudo cp -r consoles console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus
Configuring Prometheus
Create the main configuration file:
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets:
- 'breeze-web-01:9100'
- 'breeze-web-02:9100'
- 'breeze-db-01:9100'
relabel_configs:
- source_labels: [__address__]
regex: '(.*):\d+'
target_label: instance
- job_name: 'nginx'
static_configs:
- targets: ['breeze-web-01:9113']
Installing Node Exporter on Targets
Install Node Exporter on each Breeze instance you want to monitor:
NODE_VERSION="1.7.0"
wget "https://github.com/prometheus/node_exporter/releases/download/v${NODE_VERSION}/node_exporter-${NODE_VERSION}.linux-amd64.tar.gz"
tar xzf "node_exporter-${NODE_VERSION}.linux-amd64.tar.gz"
sudo cp "node_exporter-${NODE_VERSION}.linux-amd64/node_exporter" /usr/local/bin/
# Create systemd service
sudo cat <<'EOF' > /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Defining Alert Rules
Create alert rules to detect problems:
# /etc/prometheus/alert_rules.yml
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for 5 minutes (current: {{ $value }}%)"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Root filesystem has less than 15% free space"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
Setting Up Alertmanager
Install and configure Alertmanager to route notifications:
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'smtp-password'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email-team'
routes:
- match:
severity: critical
receiver: 'email-critical'
repeat_interval: 1h
receivers:
- name: 'email-team'
email_configs:
- to: 'team@example.com'
- name: 'email-critical'
email_configs:
- to: 'oncall@example.com'
webhook_configs:
- url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
Best Practices
- Start with Node Exporter — get CPU, memory, disk, and network metrics first
- Use recording rules — pre-compute expensive queries for dashboard performance
- Set meaningful alert thresholds — avoid alert fatigue with well-tuned thresholds and
fordurations - Add labels — use labels to identify environments, teams, and services for routing
- Retain data wisely — set
--storage.tsdb.retention.time=30dbased on your disk capacity
Prometheus and Alertmanager give your Breeze infrastructure proactive monitoring, so you catch problems before your users do.