Obsfly
slo / 30-day budget · 4 indicatorslivep99 query latency < 200ms64% burntavailability ≥ 99.9%18% burntconnection error rate < 0.1%92% burntreplica lag < 5s42% burnt

SRE

Des SLOs DB qui ne sont pas inutiles : une définition opérationnelle

La plupart des SLOs DB sont 'CPU sous 80%'. C'est une alerte budget, pas un objectif. Voici comment définir un SLO qu'un dirigeant signe et qu'un ingénieur peut suivre.

Published ·9 min read

“CPU under 80%” isn’t an SLO. It’s a budget alert. A real database SLO has three parts — Service Level Indicator, target, and error budget — and exists to drive decisions, not to sit on a Confluence page.

On this page
  1. What an SLO actually is
  2. Concrete examples by workload
  3. Using the error budget
  4. Anti-patterns
  5. FAQ

SLI, SLO, error budget

  • SLI — a metric users notice, expressed as a fraction. “Of all reads against the orders DB in the last hour, what fraction returned in < 100ms?”
  • SLO — a target for the SLI over a window. “99.5% of reads complete in < 100ms over a 30-day window.”
  • Error budget — what’s left to burn. If your SLO is 99.5%, your budget is 0.5% × hours/month = 3.6 hours of failure-budget per month.

Concrete examples by workload

# OLTP read-heavy
99.5% of SELECTs against the orders DB return in < 100ms,
   over a rolling 30-day window.

# OLTP write
99.0% of INSERTs into orders complete in < 250ms,
   over a rolling 30-day window.

# Analytics
95% of dashboard queries return in < 5s,
   measured per-query, rolled up over 7 days.

# Queue-style
99.9% of rows enqueued are picked up within 5s,
   measured by enqueue → first-fetch latency.

# Replication
99.5% of read-replica queries see data committed within 100ms of master,
   measured per-statement-level lag.

Using the error budget

The error budget is what makes an SLO operational. Three concrete uses:

  • Decision lever in incident reviews — if you’ve burned 80% of your budget mid-month, the next risky migration waits.
  • Change-management gate — releases pause when budget is exhausted, ship freely when it’s healthy.
  • Capacity prioritization — adjacent SLOs all healthy + this one bleeding = scale or tune this one first.

Anti-patterns

  • The 100% SLO — math says you can’t hit it. Every minute of natural variance is an SLO violation. Pick 99.5 or 99.9, not 100.
  • The infrastructure SLI — “CPU < 80%” isn’t an SLI. Users don’t feel CPU; they feel latency or errors.
  • The single-window SLO — only measuring monthly hides bad weeks. Roll up at multiple windows (1h, 24h, 7d, 30d) and alert on multi-window burn rates.
  • The SLO nobody owns — if it doesn’t name a team and a dashboard, it doesn’t exist.

FAQ

How tight should the SLO target be?+
Tight enough that the team would actually pause work to defend it; loose enough that natural variance doesn't trigger it. Start at 99.5% and adjust on review.
Multi-window burn rate alerting?+
Yes — Google's SRE workbook chapter on burn rate is the canonical reference. Alert when the budget will be exhausted in <2h based on current burn.
Per-tenant SLOs?+
Useful for B2B platforms with named-account SLAs. Keep them as derivatives of fleet SLOs, not parallel definitions.

Keep reading

· · ·

Surveillez vos bases comme vos services.

Réservez une démo de 30 minutes. Nous spécifions votre flotte ensemble et chiffrons votre premier deal de 30 jours.

Des SLOs DB qui ne sont pas inutiles : une définition opérationnelle · Obsfly