Build vs Buy

Grafana DBM build-vs-buy: what the 'we'll just use Prometheus' plan actually costs

postgres_exporter ships in an afternoon. Per-query digests, plan-flip detection, lock-chain graphs, anomaly bands — each of those costs 1–3 engineer-weeks. We measure the real build cost vs Obsfly's $39/DB and tell you when each side wins.

Published 2026-05-26·13 min read

The conversation goes like this. At standup someone says “we’ll just stand up Grafana with postgres_exporter and Prometheus, we already have the stack.” Nobody objects. Six months later you have a homemade DBM, you missed two plan flips that caused outages, and the platform engineer who built it is now its full-time owner.

This isn’t an attack on Grafana — Grafana is a great visualization tool and we run it ourselves. It’s a measurement of what “just” actually means when the goal is DBM-grade tooling.

On this page

What Grafana + postgres_exporter actually gives you
The 11 things you have to build
Build cost in engineer-weeks
Side-by-side with Obsfly
When DIY actually wins
When DIY is a trap
FAQ

What Grafana + postgres_exporter actually gives you

Cluster-level counters: connections, transactions, buffer hit ratio, replication lag, dead tuple ratio, WAL position. Genuinely useful.
A dashboard editor that is the best in the industry. Free to use, easy to share, plays with multiple data sources.
Prometheus alerting on threshold rules. Adequate for “disk full” class of alerts.
A starter dashboard you can ship in an afternoon. This is the honeypot — it’s real, and it’s the cheapest part of the journey.

The 11 things you have to build

This is what postgres_exporter + Prometheus + Grafana does not ship:

Capability	Build effort	Why it matters
1. Per-query top-N with percentiles	1–2 engineer-weeks	pg_stat_statements polling, digest normalization, p50/p95/p99 computation, dashboard.
2. Plan capture	1–2 weeks	auto_explain configuration, log scraping or pg_stat_plans extension, storage backend.
3. Plan-flip / regression detection	1 week	structure-hash diff per signature over a window, alert when hash flips.
4. Lock-chain / blocking-session graph	1 week	pg_blocking_pids() traversal, graph layout, surfacing in dashboard.
5. Anomaly detection per metric	2–3 weeks	Pick algorithm (BOCPD / Prophet / STL), per-metric training, persistent state.
6. Forecast bands (30/90/365 day)	1–2 weeks	Seasonal decomposition, percentile bands, breach detection.
7. Multi-database fan-out	2 weeks per engine	MySQL Performance Schema, MongoDB profiler, Redis slowlog have different shapes.
8. Alerting beyond thresholds	1 week	Multi-variate, change-point, derived signals (cache hit + lock wait + qps).
9. AI rewrite / index suggestion	2–4 weeks	LLM integration, prompt engineering, plan-aware context assembly.
10. Retention strategy	1 week ongoing	Prometheus is not for 15-month retention. You’ll add Thanos or VictoriaMetrics.
11. Maintenance + upgrade cycles	0.25–0.5 FTE forever	Exporter upgrades, breaking schema changes, dashboard drift, alert tuning.

Build cost in engineer-weeks

Total greenfield build for a Postgres-only, single-engine DBM stack: 16–28 engineer-weeks. Multi-database (add MySQL + MongoDB) doubles it. Steady-state maintenance: 0.25–0.5 FTE per quarter.

At a $180k loaded engineer cost	Annual
Build (one-time, amortized over 2 years)	$36k–$63k / yr
Maintenance (steady-state)	$45k–$90k / yr
Total (DIY, Postgres-only)	$81k–$153k / yr
Total (DIY, +MySQL +MongoDB)	$150k–$280k / yr

Side-by-side with Obsfly

Capability	Grafana + Prometheus + exporters	Obsfly
Top-N queries with p99	Build (1–2 wk)	Out of box
Plan history / diff	Build (2–3 wk)	Out of box
Lock chains	Build (1 wk)	Out of box
Multi-variate anomaly	Build (2–3 wk)	Out of box
Forecast bands	Build (1–2 wk)	Out of box
Multi-DB fan-out	Build per engine	9 engines covered
AI query rewrite	Build (2–4 wk) or skip	Built in (Claude)
BYOC / Sovereign	Self-host the OSS stack	First-class
Time to first useful insight	Weekend → quarter	5 minutes
Cost (50-DB fleet, 2 yr)	$240k–$420k engineer time	$1,950/mo × 24 = $46,800

When DIY actually wins

Single Postgres database, no plan history needed, no anomaly detection needed. The afternoon dashboard is genuinely enough.
Strong infra team with slack capacity and an existing observability org that builds these things as a competence. The maintenance cost is internalized.
Specific compliance requirements that no commercial vendor can meet — though check if BYOC or Sovereign solves it first, since we built both for exactly this case.

When DIY is a trap

More than two database engines in the fleet.
Series A–C team where engineering time is the constraint, not budget.
You’ve already had one production incident that better tooling would have caught (plan flip, hot lock chain, slow regression).
The platform engineer who’d build it is also your only Kubernetes person, only CI/CD person, and only secrets-management person. They will not maintain four things at once well.

FAQ

Can I keep Grafana and use Obsfly as the data source?+

Yes. Obsfly exposes a Prometheus-compatible /metrics endpoint. Your existing Grafana dashboards keep working; you get richer series (per-query digests, plan-flip events, forecast bands) added to them.

What about Grafana Cloud Database Observability?+

It's a thin layer on top of the same exporter approach plus k6 traces. Useful for AWS-native shops already deep in Grafana Cloud, but the depth gap to a real DBM (plan history, lock chains, AI) is similar.

Doesn't postgres_exporter already capture pg_stat_statements?+

Some forks expose subsets of pg_stat_statements as Prometheus metrics. None do it well — the dimensionality blows up Prometheus storage, and you lose the actual query text in the digest. You end up writing a separate digest collector either way.

What's the cheapest way to validate the build estimate?+

Ask your most senior platform engineer to scope a one-pager for capability #5 only (anomaly detection per metric). If they come back with less than 2 weeks, the estimate is wrong; if more than 4 weeks, it's wrong the other way. The honest scope is in there.

Keep reading

Postgres

pg_stat_statements: the complete 2026 guide

Every column, every gotcha, the queries you should run today, and why pg_stat_statements is still the most useful 80 lines of telemetry in Postgres — even with five new alternatives in 2026.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

Database capacity forecasting that actually catches breaches 30 days out

Linear regression isn't enough. ARIMA is overkill. Prophet works but you need to know which exogenous variables to feed it. A practical recipe for capacity forecasts that page you 30 days before the cliff.

← All posts