AI
数据库指标异常检测:为什么阈值失效,什么真正有效
预测带、变点检测、多变量异常,以及让 'p99 > 200ms' 默认成为错误告警的季节性数学。
Threshold alerts are a 2010 idea that observability teams keep paying for in 2026. p99 over 200ms, connections over 80% of max, error rate over 1%. They generate noise when nothing is wrong, and they’re silent when something is. This post is about why thresholds fail on database metrics specifically, and what to use instead.
On this page
The threshold problem
Pick any DB metric: query latency, IOPS, connection count, replication lag. Plot it for a week. You see daily and weekly seasonality, regime changes when releases ship, and tail spikes that aren’t actually problems. A threshold draws one horizontal line through all of that.
Two failure modes:
- False positives. Latency at 14:00 daily is naturally 2× the 03:00 baseline. Your 200ms threshold pages every weekday at lunch.
- False negatives. A regression doubles latency from 30ms to 60ms — both still under 200ms. The threshold says everything is fine.
Why naive forecasting also fails
The first thing observability teams reach for is “a moving average ± 3σ.” That fails too. Three reasons:
- Seasonality. A 1-hour moving average can’t tell you that 14:00 is normally high. It just averages and gets confused.
- Regime changes. Push a release that doubles QPS — the moving average catches up in an hour and stops alerting on the new regime, even if the new regime is broken.
- Long-tail metrics. Latency has a power-law tail; σ from a Gaussian model doesn’t fit. You get spurious 5σ alerts when reality was just ordinary heavy-tail behavior.
What actually works
Three modeling families, used together:
1. Forecasts with seasonality
Prophet (Facebook), STL decomposition, ETS, or ETSformer. They fit a model that’s aware of daily / weekly / business-day seasonality. The output isn’t a number — it’s an interval [lower, upper] at a given quantile. Reality outside the interval is the anomaly.
2. Change-point detection (BOCPD)
Bayesian Online Change-Point Detection finds the moments your time series shifted regimes. Useful because anomalies often start at a change point. CPD answers “when did this start being weird?” rather than “is it weird right now?”
3. Robust EWMA fallback
For low-volume metrics where you don’t have weeks of training data, an EWMA + MAD (median absolute deviation) is robust to outliers and works from day one. Use it as the floor while the seasonal model is still warming up.
Forecast bands vs threshold lines
A threshold is a flat line. A forecast band is a tube around the predicted future, computed at every timestamp from the model’s posterior. When reality leaves the tube, you alert. When the tube widens (the model is uncertain), you alert less aggressively.
# pseudo-code, the math you actually want
for t in horizon:
mu, sigma = forecast_model.predict(t)
lower = mu - z_alpha * sigma
upper = mu + z_alpha * sigma
if observed[t] < lower or observed[t] > upper:
emit_anomaly(t, severity = z_score(observed[t], mu, sigma))This trivially handles the “14:00 is naturally high” case — the band rises with it.
Multi-variate anomalies
Some incidents are visible only in the joint distribution. CPU is fine. IOPS is fine. The ratio of the two — IOPS / CPU — is broken because the workload mix changed.
Modeling per-metric and OR-ing the alerts misses these. You need a model that ingests multiple metrics together — even a simple linear regression of one on the others is enough to flag joint shifts.
Seasonality is most of the data
On any DB metric you care about, daily and weekly cycles explain > 80% of variance. A model that ignores them is mostly modeling noise. The math is well-understood — Fourier basis, STL, dummy variables for weekend/holidays — and there’s no excuse to ship without it in 2026.
The feedback loop nobody builds
Every anomaly alert should have a thumbs-up / thumbs-down button. Operators know whether a given alert was real or noise. Feed that back into a per-tenant fine-tuning step and your model gets sharper every week.
It’s the difference between a generic anomaly model and a model that’s right about your fleet. We do this in Obsfly Anomaly Detection: every alert ships with a feedback control, and the per-tenant model retrains nightly.
FAQ
How long until a forecast model is useful?+
What about deploys — they look like anomalies but aren't.+
Can I bring my own anomaly model?+
What about pure ML models — do you use them?+
Keep reading
Postgres
Why your Postgres p99 latency lies — and what to track instead
p99 over 1m windows is the most-displayed and most-misleading number on every DBM dashboard. Here's the histogram math, the seasonality math, and a saner default.
Postgres
Postgres slow queries: 12 causes and how to find each one
A field-tested playbook for diagnosing a slow Postgres query in production — from missing indexes to plan flips to bloated tables — with the SQL to find each cause and the fix.