Postgres
Warum dein Postgres-p99 lügt — und was du stattdessen tracken solltest
p99 über 1m-Fenster ist die meistgezeigte und irreführendste Zahl auf jedem DBM-Dashboard. Hier sind die Histogramm-Mathematik, die Saisonalitätsmathematik und ein vernünftigerer Default.
Open any DBM dashboard. The first chart is “p99 query latency.” A pretty line, mostly flat, sometimes spikes. Engineers stare at it for hours. And most of the time, it’s lying to them.
The problem isn’t Postgres. It’s how the percentile is computed: a histogram bucketed per signature, aggregated over a 1-minute window, then re-aggregated across hosts. By the time it reaches your eye, three lossy operations have happened — and each one drops the tail.
The math, briefly
A p99 over a 1m window represents the latency below which 99% of executions in that window completed. If you ran 10,000 queries that minute, p99 is the 100th-slowest. If you ran 100, it’s the slowest one. The stability of the number scales with sqrt(N), but the importance of the number scales with how often you’re hit by tail latency — which is workload-dependent, not sample-dependent.
Worse: arithmetic mean of per-host p99s is not p99 across hosts. If host A’s p99 is 100ms and host B’s p99 is 500ms, the global p99 isn’t 300ms — it’s closer to 500ms because the global tail is the union of the host tails.
What to track instead
- p99.9 — the actual tail. If 0.1% of your queries take 5s, that’s pages, not noise.
- Maximum executed in the window. Useful when N is small, dangerous when N is huge — but should never be hidden.
- Percentile rank of a fixed threshold (“% of queries over 200ms”). More stable than the percentile itself.
- HDR-histogram-merged percentiles across hosts (not arithmetic-mean of per-host p99s).
How Obsfly computes this
We store per-signature, per-host HDR / t-digest histograms in ClickHouse and compute percentiles at query time across whatever set of hosts and signatures the user picks. No pre-aggregation, no merge-of-merges, no information loss.
SELECT quantilesTDigestMerge(0.5, 0.95, 0.99, 0.999)(td) FROM obsfly.metric_tdigest WHERE name = 'postgres.queries.time_ms' AND ts >= now() - INTERVAL 1 HOUR AND signature = 'a1b2c3d4...'
Take-away
If your DBM only shows you p99, complain. If it shows you per-host p99 averaged into a global p99, switch. If you build your own — store t-digests, not buckets, and merge correctly. The math has been solved since 2012 and there’s no excuse to ship anything else.
Related reading on the actual cost of poor percentile math: Postgres slow queries: 12 causes, Anomaly detection on database metrics.
Should I just use p99.9 instead of p99?+
What's an HDR histogram and why does it matter?+
Keep reading
Postgres
pg_stat_statements: the complete 2026 guide
Every column, every gotcha, the queries you should run today, and why pg_stat_statements is still the most useful 80 lines of telemetry in Postgres — even with five new alternatives in 2026.
AI
Anomaly detection on database metrics: why thresholds fail and what works
A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.