Docs / Monitoring & Logging / Building SLO Dashboards in Grafana

Building SLO Dashboards in Grafana

By Admin · Mar 15, 2026 · Updated Apr 23, 2026 · 375 views · 3 min read

Service Level Objectives (SLOs) define the reliability targets for your services — the percentage of requests that should succeed, the maximum acceptable latency, and the allowed downtime. SLO dashboards translate these targets into actionable metrics, showing whether you are meeting your commitments and how much error budget remains. This guide covers building effective SLO dashboards in Grafana with Prometheus.

SLO Fundamentals

  • SLI (Service Level Indicator) — the measured metric (e.g., success rate, latency)
  • SLO (Service Level Objective) — the target value (e.g., 99.9% availability)
  • Error Budget — the allowed amount of failure (e.g., 0.1% = 43 minutes/month)

Availability SLO

# SLI: Ratio of successful requests
# Prometheus recording rule
- record: service:http_requests:availability30d
  expr: |
    sum(rate(http_requests_total{status!~"5.."}[30d]))
    / sum(rate(http_requests_total[30d]))

# Error budget remaining (for 99.9% SLO)
- record: service:error_budget:remaining
  expr: |
    1 - (
      (1 - service:http_requests:availability30d)
      / (1 - 0.999)
    )

# Error budget consumption rate
- record: service:error_budget:burn_rate1h
  expr: |
    (1 - (
      sum(rate(http_requests_total{status!~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    )) / (1 - 0.999)

Grafana Dashboard Panels

Availability Gauge

# Panel: Gauge
# Query: service:http_requests:availability30d * 100
# Unit: percent
# Thresholds: 99.9=green (SLO met), 99.5=yellow, 99.0=red
# Title: "30-Day Availability"

Error Budget Remaining

# Panel: Stat
# Query: service:error_budget:remaining * 100
# Unit: percent
# Thresholds: 50=green, 20=yellow, 0=red
# Title: "Error Budget Remaining"
# When this reaches 0%, you have exhausted your error budget

Error Budget Burn Rate

# Panel: Time Series
# Query: service:error_budget:burn_rate1h
# Y-axis: Burn rate (1.0 = consuming at exactly the allowed rate)
# Thresholds: horizontal line at 1.0 (above = burning too fast)
# Title: "Error Budget Burn Rate"

# Interpretation:
# burn_rate = 1.0 → consuming budget at exactly the allowed rate
# burn_rate = 2.0 → consuming budget 2x too fast (will exhaust in 15 days)
# burn_rate = 14.4 → will exhaust 30-day budget in 50 hours

Error Budget Timeline

# Panel: Time Series (area chart)
# Query: service:error_budget:remaining * 100
# Fill below to show remaining budget as area
# Title: "Error Budget Over Time"

Latency SLO

# SLI: P99 latency under threshold
# Recording rules
- record: service:http_latency_slo:ratio5m
  expr: |
    sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
    / sum(rate(http_request_duration_seconds_count[5m]))

# Grafana panel
# Query: service:http_latency_slo:ratio5m * 100
# Shows: percentage of requests under 500ms
# SLO target line: 99% of requests under 500ms

Multi-Window Burn Rate Alerting

# Fast burn: will exhaust budget in 1 hour
- alert: SLOHighBurnRate
  expr: |
    (
      service:error_budget:burn_rate1h > 14.4
      and
      service:error_budget:burn_rate5m > 14.4
    )
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "SLO burn rate critical — budget will be exhausted within 1 hour"

# Slow burn: will exhaust budget in 3 days
- alert: SLOSlowBurnRate
  expr: |
    (
      service:error_budget:burn_rate6h > 5
      and
      service:error_budget:burn_rate30m > 5
    )
  for: 15m
  labels:
    severity: warning

Complete Dashboard Layout

# Row 1: SLO Summary (4 stat panels)
# - Current Availability (99.95%)
# - Error Budget Remaining (72%)
# - SLO Target (99.9%)
# - Time Until Budget Exhaustion (18 days)

# Row 2: Availability Over Time (2 time series)
# - Availability trend (30d rolling window)
# - Error budget burn rate

# Row 3: Request Metrics (3 time series)
# - Request rate (successful vs failed)
# - Error rate percentage
# - P99 latency vs SLO threshold

# Row 4: Error Budget Detail (2 panels)
# - Error budget remaining over time
# - Budget consumption by error type (table)

Best Practices

  • Define SLOs before building dashboards — the dashboard should reflect business commitments
  • Use 30-day rolling windows for SLO measurement (industry standard)
  • Display error budget prominently — it is the actionable metric teams can manage
  • Use multi-window burn rate alerting instead of threshold-based alerts
  • Show SLO dashboards on team monitors and in sprint reviews
  • Start with 2-3 SLOs (availability, latency) — do not try to measure everything
  • Review and adjust SLOs quarterly based on business needs and actual performance

Was this article helpful?