SRE
Des SLOs DB qui ne sont pas inutiles : une définition opérationnelle
La plupart des SLOs DB sont 'CPU sous 80%'. C'est une alerte budget, pas un objectif. Voici comment définir un SLO qu'un dirigeant signe et qu'un ingénieur peut suivre.
“CPU under 80%” isn’t an SLO. It’s a budget alert. A real database SLO has three parts — Service Level Indicator, target, and error budget — and exists to drive decisions, not to sit on a Confluence page.
On this page
SLI, SLO, error budget
- SLI — a metric users notice, expressed as a fraction. “Of all reads against the orders DB in the last hour, what fraction returned in < 100ms?”
- SLO — a target for the SLI over a window. “99.5% of reads complete in < 100ms over a 30-day window.”
- Error budget — what’s left to burn. If your SLO is 99.5%, your budget is 0.5% × hours/month = 3.6 hours of failure-budget per month.
Concrete examples by workload
# OLTP read-heavy 99.5% of SELECTs against the orders DB return in < 100ms, over a rolling 30-day window. # OLTP write 99.0% of INSERTs into orders complete in < 250ms, over a rolling 30-day window. # Analytics 95% of dashboard queries return in < 5s, measured per-query, rolled up over 7 days. # Queue-style 99.9% of rows enqueued are picked up within 5s, measured by enqueue → first-fetch latency. # Replication 99.5% of read-replica queries see data committed within 100ms of master, measured per-statement-level lag.
Using the error budget
The error budget is what makes an SLO operational. Three concrete uses:
- Decision lever in incident reviews — if you’ve burned 80% of your budget mid-month, the next risky migration waits.
- Change-management gate — releases pause when budget is exhausted, ship freely when it’s healthy.
- Capacity prioritization — adjacent SLOs all healthy + this one bleeding = scale or tune this one first.
Anti-patterns
- The 100% SLO — math says you can’t hit it. Every minute of natural variance is an SLO violation. Pick 99.5 or 99.9, not 100.
- The infrastructure SLI — “CPU < 80%” isn’t an SLI. Users don’t feel CPU; they feel latency or errors.
- The single-window SLO — only measuring monthly hides bad weeks. Roll up at multiple windows (1h, 24h, 7d, 30d) and alert on multi-window burn rates.
- The SLO nobody owns — if it doesn’t name a team and a dashboard, it doesn’t exist.
FAQ
How tight should the SLO target be?+
Multi-window burn rate alerting?+
Per-tenant SLOs?+
Keep reading
Postgres
Why your Postgres p99 latency lies — and what to track instead
p99 over 1m windows is the most-displayed and most-misleading number on every DBM dashboard. Here's the histogram math, the seasonality math, and a saner default.
AI
Anomaly detection on database metrics: why thresholds fail and what works
A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.