Postgres

Warum dein Postgres-p99 lügt — und was du stattdessen tracken solltest

p99 über 1m-Fenster ist die meistgezeigte und irreführendste Zahl auf jedem DBM-Dashboard. Hier sind die Histogramm-Mathematik, die Saisonalitätsmathematik und ein vernünftigerer Default.

Published 2026-04-10·Updated 2026-04-19·11 min read

Open any DBM dashboard. The first chart is “p99 query latency.” A pretty line, mostly flat, sometimes spikes. Engineers stare at it for hours. And most of the time, it’s lying to them.

The problem isn’t Postgres. It’s how the percentile is computed: a histogram bucketed per signature, aggregated over a 1-minute window, then re-aggregated across hosts. By the time it reaches your eye, three lossy operations have happened — and each one drops the tail.

On this page

The math, briefly
What to track instead
How to compute it correctly
Take-away

The math, briefly

A p99 over a 1m window represents the latency below which 99% of executions in that window completed. If you ran 10,000 queries that minute, p99 is the 100th-slowest. If you ran 100, it’s the slowest one. The stability of the number scales with sqrt(N), but the importance of the number scales with how often you’re hit by tail latency — which is workload-dependent, not sample-dependent.

Worse: arithmetic mean of per-host p99s is not p99 across hosts. If host A’s p99 is 100ms and host B’s p99 is 500ms, the global p99 isn’t 300ms — it’s closer to 500ms because the global tail is the union of the host tails.

What to track instead

p99.9 — the actual tail. If 0.1% of your queries take 5s, that’s pages, not noise.
Maximum executed in the window. Useful when N is small, dangerous when N is huge — but should never be hidden.
Percentile rank of a fixed threshold (“% of queries over 200ms”). More stable than the percentile itself.
HDR-histogram-merged percentiles across hosts (not arithmetic-mean of per-host p99s).

How Obsfly computes this

We store per-signature, per-host HDR / t-digest histograms in ClickHouse and compute percentiles at query time across whatever set of hosts and signatures the user picks. No pre-aggregation, no merge-of-merges, no information loss.

SELECT
  quantilesTDigestMerge(0.5, 0.95, 0.99, 0.999)(td)
FROM obsfly.metric_tdigest
WHERE name = 'postgres.queries.time_ms'
  AND ts >= now() - INTERVAL 1 HOUR
  AND signature = 'a1b2c3d4...'

Take-away

If your DBM only shows you p99, complain. If it shows you per-host p99 averaged into a global p99, switch. If you build your own — store t-digests, not buckets, and merge correctly. The math has been solved since 2012 and there’s no excuse to ship anything else.

Related reading on the actual cost of poor percentile math: Postgres slow queries: 12 causes, Anomaly detection on database metrics.

Should I just use p99.9 instead of p99?+

Use both. p99 tracks the typical-bad case, p99.9 tracks the actual worst. Alerting on p99.9 with a forecast band catches regressions that p99 misses.

What's an HDR histogram and why does it matter?+

HDR (high dynamic range) histograms keep relative-error bounds across many decades of latency. Two HDRs from different hosts can be merged exactly, then queried for any percentile. Buckets can't be merged without information loss.

Keep reading

Postgres

pg_stat_statements: the complete 2026 guide

Every column, every gotcha, the queries you should run today, and why pg_stat_statements is still the most useful 80 lines of telemetry in Postgres — even with five new alternatives in 2026.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

← All posts