Postgres
Postgres-Slow-Queries: 12 Ursachen und wie man jede findet
Ein praxiserprobtes Playbook zur Diagnose einer langsamen Postgres-Query in Produktion — von fehlenden Indizes über Plan-Flips bis hin zu aufgeblähten Tabellen — mit dem SQL für jede Ursache und der Lösung.
Most “Postgres is slow” tickets are not really about Postgres. They’re about one query that’s slow, sometimes, on some path, and you have an hour to figure out which of a dozen possible causes is the real one. This is the playbook we use — twelve causes, the SQL to diagnose each, and the fix.
The order matters: the cheap-to-rule-out causes come first.
On this page
Step 1 — confirm it’s actually slow
Before you spend an hour, make sure the symptom is real. p99 charts lie at low call counts; mean latency hides bimodal queries. Pull the actual time per call:
SELECT queryid, calls, mean_exec_time, stddev_exec_time, max_exec_time FROM pg_stat_statements WHERE query ILIKE '%table_name%' ORDER BY total_exec_time DESC;
If stddev_exec_time is 5–10× mean_exec_time, the query is bimodal — usually fast, sometimes catastrophic. Diagnose the catastrophic path, not the mean.
Step 2 — get a real plan
Run EXPLAIN (ANALYZE, BUFFERS, VERBOSE) on the slow path with the actual parameters the production code uses. Not with literal values you guessed.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT … FROM orders WHERE customer_id = 4128;
Or, if you can’t reproduce: enable auto_explain with auto_explain.log_min_duration = '500ms' and wait for it to fire. The plan ends up in the Postgres log, untouched by you.
The 12 causes (in order of likelihood)
1. Missing index
Telltale: a Seq Scan with a filter that removes >90% of rows. EXPLAIN ANALYZE shows actual rows=N « rows removed by filter=M with M >> N.
Seq Scan on orders (cost=0..50000 rows=1000000 width=...) (actual time=120..2400 rows=4 loops=1) Filter: (customer_id = 4128) Rows Removed by Filter: 999996
Fix: CREATE INDEX CONCURRENTLY ON orders (customer_id); Be paranoid about CONCURRENTLY in production — without it, you take an AccessExclusiveLock for the duration of the build. (See Postgres lock chains.)
2. Stale statistics, wrong row estimates
Telltale: EXPLAIN ANALYZE shows rows=10 in the estimate but actual rows=2,000,000. The planner picked the wrong join order.
ANALYZE orders; -- one table ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000; -- finer histograms
For correlated columns, create extended statistics: CREATE STATISTICS s_orders (dependencies, ndistinct) ON customer_id, region FROM orders;
3. Sequential scan on a large table
Sometimes intentional (small table, planner is right). Sometimes a sign that an index exists but the planner ignores it. Run EXPLAIN with set enable_seqscan = off; in your session — if the index plan is faster, you have a cost-model mismatch (often random_page_cost too high on SSD).
4. Lock waits
Telltale: query is “sometimes fast, sometimes 30s.” pg_stat_activity shows wait_event_type = Lock during the slow path.
Diagnosis: capture pg_blocking_pids() at 1 Hz and reconstruct the lock chain. Full recipe in Postgres lock chains.
5. TOAST and oversized columns
Telltale: a SELECT * on a wide table is slow even when the WHERE clause uses a primary key. The cost is in detoasting large compressed columns (often jsonb or text bodies).
Fix: don’t SELECT *. Project only what you need.
6. Bloat
Telltale: cache hit ratio drops, shared_blks_read climbs, the same query gets slower week-over-week despite no change in data volume. pg_stat_user_tables.n_dead_tup is large relative to n_live_tup.
SELECT relname, n_live_tup, n_dead_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 20;Fix: tune autovacuum per-table. VACUUM (FULL, VERBOSE) rewrites the table — only when you can take an exclusive lock.
7. Connection pool saturation
Telltale: query latency spikes correlate with pg_stat_activity count. Postgres works fine, the application is starving for connections.
Fix: PgBouncer in transaction mode. Cap session-mode pool at <100; transaction mode handles 10x more workload at the same backend count.
8. Sort spilling to disk
Telltale: EXPLAIN ANALYZE shows Sort Method: external merge Disk: 1234kB.
Fix: bump work_mem for the role/session. Be careful — work_mem applies per sort/hash node, not per query, so a complex query can multiply it.
9. Hash join with too-small work_mem
Telltale: Hash Batches: 32 when a single batch would fit. Same fix as #8.
10. Plan flip
Telltale: query was fast, deploys nothing, gets slow. Same SQL, different plan. auto_explain proves it: yesterday’s plan used an index, today’s does a hash join.
Causes: stats updated, parameter values cross a histogram boundary, table-row count crosses a planner threshold. Fix: extended statistics, or pg_hint_plan as a pin-of-last-resort. Better: have a tool that tells you the plan flipped — see Obsfly Explain Plan.
11. Slow IO (network, disk, replica lag)
Telltale: blk_read_time dominates total_exec_time.
On RDS, hit the IOPS ceiling and your latency triples. On Aurora, replica lag pushes reads to the primary. Fix: cloud-side IOPS / autoscaling, or split read workload differently.
12. PL/pgSQL or trigger overhead
Telltale: EXPLAIN ANALYZE on the SQL is fast, but calls in pg_stat_statements shows it was called many more times than expected. A trigger or function is multiplying executions.
Fix: track = all in pg_stat_statements config to see inner statements.
What to set up so this isn’t manual next time
- Enable
pg_stat_statementswithtrack = all(see guide) - Enable
auto_explainwithlog_min_duration = 500ms - Enable
track_io_timingforblk_read_timedata - Enable
track_planningfor plan-time visibility (PG 13+) - Capture
pg_stat_activityat 1 Hz for lock-chain reconstruction - Capture EXPLAIN plans on slow signatures and store them — plan-flips need a baseline to compare against
FAQ
What's the right order to investigate causes?+
Do I need superuser to diagnose?+
How long should an EXPLAIN ANALYZE take?+
Keep reading
Postgres
pg_stat_statements: the complete 2026 guide
Every column, every gotcha, the queries you should run today, and why pg_stat_statements is still the most useful 80 lines of telemetry in Postgres — even with five new alternatives in 2026.
Postgres
Postgres lock chains: how to find the session blocking yours
A practical walkthrough of pg_locks, pg_blocking_pids, and the recursive CTE that gives you the full chain — including the AccessExclusiveLocks that quietly take your DB down.
Postgres
Why your Postgres p99 latency lies — and what to track instead
p99 over 1m windows is the most-displayed and most-misleading number on every DBM dashboard. Here's the histogram math, the seasonality math, and a saner default.