Obsfly
postgres / xid-wraparound · freeze risklivepg-prod-011.85B / 2B XIDs12d92% to wraparoundpg-prod-021.56B / 2B XIDs31d78% to wraparoundpg-stage0.44B / 2B XIDs180d+22% to wraparoundpg-analytics1.28B / 2B XIDs47d64% to wraparound

Postgres

PostgresトランザクションIDラップアラウンド:最悪の土曜まで4時間

pg_stat_activityが「autovacuum (to prevent wraparound)」を表示し書き込みが止まり始めたら、正しく対処するための4時間。間違えればDBはread-onlyに。本物のランブック。

Published ·10 min read

Saturday morning. pg_stat_activity shows autovacuum: VACUUM (to prevent wraparound) on your largest table. Application writes slow down. Then the throughput craters. If you do nothing, in a few hours your DB stops accepting writes entirely. If you cancel the vacuum, transaction-ID wraparound happens and you have a corruption risk. This is the runbook.

On this page
  1. What's actually happening
  2. Early warning that buys you weeks
  3. Already in aggressive mode
  4. After the fire
  5. FAQ

What’s actually happening

Postgres uses a 32-bit transaction ID. Every row stores the txid that wrote it. To know whether a row is visible to your transaction, Postgres compares txids. Once txids approach 2^31, the math wraps and old rows could appear newer than they are — a data-corruption hazard.

Postgres prevents this by “freezing” old rows: marking them with a special sentinel that means “older than any current txid.” Vacuum freezes pages as it goes. When a table’s oldest unfrozen txid hits autovacuum_freeze_max_age (default 200M), Postgres triggers an aggressive autovacuum that holds an exclusive lock more often and runs faster — but it’s still rate-limited by autovacuum_vacuum_cost_limit.

Early warning that buys you weeks

-- Tables nearest to wraparound — alert when any value > 800M
SELECT
  c.relnamespace::regnamespace || '.' || c.relname AS table,
  age(c.relfrozenxid) AS xid_age,
  pg_size_pretty(pg_total_relation_size(c.oid)) AS size
FROM pg_class c
WHERE c.relkind IN ('r', 't', 'm')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 20;

-- Cluster-wide datfrozenxid
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;

Already in aggressive mode

  • Don’t cancel the vacuum. If you do, autovacuum will restart it on the same table; you’ve made no progress and burned IO.
  • Raise autovacuum_vacuum_cost_limit aggressively. Default 200 → 5000+. The vacuum will use more IO but finish faster.
  • Reduce autovacuum_vacuum_cost_delay to 0 if you’re comfortable with the IO load.
  • Pause non-essential writes if possible — vacuum competes with writers for buffer locks.
  • Don’t restart the server. Restarts force a recovery scan that doesn’t advance frozen state.
-- Apply temporarily; revert after the fire is out
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 10000;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 0;
SELECT pg_reload_conf();

-- Watch progress
SELECT * FROM pg_stat_progress_vacuum;

After the fire is out

  • Lower autovacuum_freeze_max_age on hot tables (per-table override) so freeze runs more often, less aggressively.
  • Raise autovacuum_max_workers if you have CPU/IO budget — multiple tables can freeze in parallel.
  • Add the relfrozenxid alert above to your monitoring stack permanently.
  • Audit any process that resets statistics (pg_stat_reset) — aggressive resets can hide vacuum progress.

FAQ

Does pg_repack help here?+
Indirectly. pg_repack creates a new physical relation, which means new pages with current frozen txids. Useful as a follow-up after the immediate fire is out, not during.
Can I just upgrade Postgres?+
Pg 14+ has 64-bit txids in some operations and improved vacuum, but 32-bit txid for visibility checks remains. Upgrade for many reasons; wraparound prevention isn't fully solved.
How does this differ on Aurora / managed Postgres?+
Same mechanism. RDS / Aurora exposes the same parameters; tune them the same way. The main difference: cancel scope is more limited on managed services.

Keep reading

· · ·

サービスと同じようにデータベースを監視する。

30 分のデモをご予約ください。フリートを一緒に設計し、最初の 30 日間のお見積りをお出しします。

PostgresトランザクションIDラップアラウンド:最悪の土曜まで4時間 · Obsfly