Service Level Objectives (SLOs) define the reliability targets for your services — the percentage of requests that should succeed, the maximum acceptable latency, and the allowed downtime. SLO dashboards translate these targets into actionable metrics, showing whether you are meeting your commitments and how much error budget remains. This guide covers building effective SLO dashboards in Grafana with Prometheus.
SLO Fundamentals
- SLI (Service Level Indicator) — the measured metric (e.g., success rate, latency)
- SLO (Service Level Objective) — the target value (e.g., 99.9% availability)
- Error Budget — the allowed amount of failure (e.g., 0.1% = 43 minutes/month)
Availability SLO
# SLI: Ratio of successful requests
# Prometheus recording rule
- record: service:http_requests:availability30d
expr: |
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
# Error budget remaining (for 99.9% SLO)
- record: service:error_budget:remaining
expr: |
1 - (
(1 - service:http_requests:availability30d)
/ (1 - 0.999)
)
# Error budget consumption rate
- record: service:error_budget:burn_rate1h
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
)) / (1 - 0.999)
Grafana Dashboard Panels
Availability Gauge
# Panel: Gauge
# Query: service:http_requests:availability30d * 100
# Unit: percent
# Thresholds: 99.9=green (SLO met), 99.5=yellow, 99.0=red
# Title: "30-Day Availability"
Error Budget Remaining
# Panel: Stat
# Query: service:error_budget:remaining * 100
# Unit: percent
# Thresholds: 50=green, 20=yellow, 0=red
# Title: "Error Budget Remaining"
# When this reaches 0%, you have exhausted your error budget
Error Budget Burn Rate
# Panel: Time Series
# Query: service:error_budget:burn_rate1h
# Y-axis: Burn rate (1.0 = consuming at exactly the allowed rate)
# Thresholds: horizontal line at 1.0 (above = burning too fast)
# Title: "Error Budget Burn Rate"
# Interpretation:
# burn_rate = 1.0 → consuming budget at exactly the allowed rate
# burn_rate = 2.0 → consuming budget 2x too fast (will exhaust in 15 days)
# burn_rate = 14.4 → will exhaust 30-day budget in 50 hours
Error Budget Timeline
# Panel: Time Series (area chart)
# Query: service:error_budget:remaining * 100
# Fill below to show remaining budget as area
# Title: "Error Budget Over Time"
Latency SLO
# SLI: P99 latency under threshold
# Recording rules
- record: service:http_latency_slo:ratio5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))
# Grafana panel
# Query: service:http_latency_slo:ratio5m * 100
# Shows: percentage of requests under 500ms
# SLO target line: 99% of requests under 500ms
Multi-Window Burn Rate Alerting
# Fast burn: will exhaust budget in 1 hour
- alert: SLOHighBurnRate
expr: |
(
service:error_budget:burn_rate1h > 14.4
and
service:error_budget:burn_rate5m > 14.4
)
for: 2m
labels:
severity: critical
annotations:
summary: "SLO burn rate critical — budget will be exhausted within 1 hour"
# Slow burn: will exhaust budget in 3 days
- alert: SLOSlowBurnRate
expr: |
(
service:error_budget:burn_rate6h > 5
and
service:error_budget:burn_rate30m > 5
)
for: 15m
labels:
severity: warning
Complete Dashboard Layout
# Row 1: SLO Summary (4 stat panels)
# - Current Availability (99.95%)
# - Error Budget Remaining (72%)
# - SLO Target (99.9%)
# - Time Until Budget Exhaustion (18 days)
# Row 2: Availability Over Time (2 time series)
# - Availability trend (30d rolling window)
# - Error budget burn rate
# Row 3: Request Metrics (3 time series)
# - Request rate (successful vs failed)
# - Error rate percentage
# - P99 latency vs SLO threshold
# Row 4: Error Budget Detail (2 panels)
# - Error budget remaining over time
# - Budget consumption by error type (table)
Best Practices
- Define SLOs before building dashboards — the dashboard should reflect business commitments
- Use 30-day rolling windows for SLO measurement (industry standard)
- Display error budget prominently — it is the actionable metric teams can manage
- Use multi-window burn rate alerting instead of threshold-based alerts
- Show SLO dashboards on team monitors and in sprint reviews
- Start with 2-3 SLOs (availability, latency) — do not try to measure everything
- Review and adjust SLOs quarterly based on business needs and actual performance