Prometheus recording rules pre-compute frequently used or computationally expensive queries, storing the results as new time series. This dramatically improves dashboard loading times and enables complex calculations that would be too slow to run in real-time. This guide covers creating effective recording rules for production monitoring.
Why Recording Rules?
- Performance — pre-computed queries load instantly in Grafana dashboards
- Consistency — complex calculations are defined once, used everywhere
- Alerting — alerting rules can reference recording rules for complex conditions
- Federation — aggregate metrics before federating to reduce data transfer
Recording Rule Configuration
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: node_metrics
interval: 30s
rules:
# CPU utilization percentage
- record: instance:node_cpu_utilization:ratio
expr: |
1 - avg without(cpu, mode) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
)
# Memory utilization percentage
- record: instance:node_memory_utilization:ratio
expr: |
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk utilization percentage
- record: instance:node_filesystem_utilization:ratio
expr: |
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
# Network receive rate (bytes/sec)
- record: instance:node_network_receive_bytes:rate5m
expr: |
sum without(device) (
rate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*|br-.*"}[5m])
)
# Network transmit rate
- record: instance:node_network_transmit_bytes:rate5m
expr: |
sum without(device) (
rate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*|br-.*"}[5m])
)
Application Recording Rules
groups:
- name: http_metrics
interval: 30s
rules:
# Request rate per service
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
# Error rate percentage
- record: service:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
# P99 latency
- record: service:http_request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# P95 latency
- record: service:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# P50 latency
- record: service:http_request_duration_seconds:p50
expr: |
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
SLO Recording Rules
groups:
- name: slo_metrics
interval: 30s
rules:
# Availability (requests succeeding / total requests)
- record: service:availability:ratio5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
# Error budget remaining (for 99.9% SLO)
- record: service:error_budget_remaining:ratio
expr: |
1 - (
(1 - service:availability:ratio5m)
/ (1 - 0.999)
)
Naming Convention
# Follow the Prometheus naming convention:
# level:metric_name:operations
#
# level: aggregation level (instance, job, service, cluster)
# metric_name: the base metric name
# operations: what operations were applied (rate5m, ratio, p99)
#
# Examples:
# instance:node_cpu_utilization:ratio
# service:http_requests:rate5m
# job:process_open_fds:max
Using Recording Rules in Grafana
# Instead of computing in the dashboard:
# OLD (slow): 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# NEW (instant): instance:node_cpu_utilization:ratio
# Reference recording rules directly in Grafana panels
# They appear as regular metrics in the metric browser
Prometheus Configuration
# prometheus.yml
rule_files:
- "rules/recording_rules.yml"
- "rules/alerting_rules.yml"
# Validate rules before applying
promtool check rules rules/recording_rules.yml
Best Practices
- Create recording rules for any query used in more than one dashboard or alert
- Follow the
level:metric:operationsnaming convention - Set recording rule evaluation interval equal to or longer than the scrape interval
- Use recording rules for histogram_quantile calculations — they are computationally expensive
- Validate rules with
promtool check rulesbefore reloading Prometheus - Use recording rules as the basis for alerting rules — alerting should never use expensive raw queries