Docs / Monitoring & Logging / Prometheus Recording Rules for Performance

Prometheus Recording Rules for Performance

By Admin · Mar 15, 2026 · Updated Apr 24, 2026 · 481 views · 4 min read

Prometheus recording rules pre-compute frequently used or computationally expensive queries, storing the results as new time series. This dramatically improves dashboard loading times and enables complex calculations that would be too slow to run in real-time. This guide covers creating effective recording rules for production monitoring.

Why Recording Rules?

  • Performance — pre-computed queries load instantly in Grafana dashboards
  • Consistency — complex calculations are defined once, used everywhere
  • Alerting — alerting rules can reference recording rules for complex conditions
  • Federation — aggregate metrics before federating to reduce data transfer

Recording Rule Configuration

# /etc/prometheus/rules/recording_rules.yml
groups:
  - name: node_metrics
    interval: 30s
    rules:
      # CPU utilization percentage
      - record: instance:node_cpu_utilization:ratio
        expr: |
          1 - avg without(cpu, mode) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

      # Memory utilization percentage
      - record: instance:node_memory_utilization:ratio
        expr: |
          1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

      # Disk utilization percentage
      - record: instance:node_filesystem_utilization:ratio
        expr: |
          1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

      # Network receive rate (bytes/sec)
      - record: instance:node_network_receive_bytes:rate5m
        expr: |
          sum without(device) (
            rate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*|br-.*"}[5m])
          )

      # Network transmit rate
      - record: instance:node_network_transmit_bytes:rate5m
        expr: |
          sum without(device) (
            rate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*|br-.*"}[5m])
          )

Application Recording Rules

groups:
  - name: http_metrics
    interval: 30s
    rules:
      # Request rate per service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      # Error rate percentage
      - record: service:http_errors:ratio5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)

      # P99 latency
      - record: service:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

      # P95 latency
      - record: service:http_request_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

      # P50 latency
      - record: service:http_request_duration_seconds:p50
        expr: |
          histogram_quantile(0.50,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

SLO Recording Rules

groups:
  - name: slo_metrics
    interval: 30s
    rules:
      # Availability (requests succeeding / total requests)
      - record: service:availability:ratio5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)

      # Error budget remaining (for 99.9% SLO)
      - record: service:error_budget_remaining:ratio
        expr: |
          1 - (
            (1 - service:availability:ratio5m)
            / (1 - 0.999)
          )

Naming Convention

# Follow the Prometheus naming convention:
# level:metric_name:operations
#
# level: aggregation level (instance, job, service, cluster)
# metric_name: the base metric name
# operations: what operations were applied (rate5m, ratio, p99)
#
# Examples:
# instance:node_cpu_utilization:ratio
# service:http_requests:rate5m
# job:process_open_fds:max

Using Recording Rules in Grafana

# Instead of computing in the dashboard:
# OLD (slow): 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# NEW (instant): instance:node_cpu_utilization:ratio

# Reference recording rules directly in Grafana panels
# They appear as regular metrics in the metric browser

Prometheus Configuration

# prometheus.yml
rule_files:
  - "rules/recording_rules.yml"
  - "rules/alerting_rules.yml"

# Validate rules before applying
promtool check rules rules/recording_rules.yml

Best Practices

  • Create recording rules for any query used in more than one dashboard or alert
  • Follow the level:metric:operations naming convention
  • Set recording rule evaluation interval equal to or longer than the scrape interval
  • Use recording rules for histogram_quantile calculations — they are computationally expensive
  • Validate rules with promtool check rules before reloading Prometheus
  • Use recording rules as the basis for alerting rules — alerting should never use expensive raw queries

Was this article helpful?