DevOps
Monitoring schema migrations: how to ship without taking the database down
ALTER TABLE on a billion-row table is the most-feared 30-line PR in any backend repo. Here's the monitoring you need before, during, and after — for Postgres, MySQL, and MongoDB.
ALTER TABLE on a billion-row table is the most-feared 30-line PR in any backend repo. The migration worked in staging because staging is 100k rows. Production has 1B rows, an index that fits in cache, and the next thing you know you’ve held an exclusive lock for 47 minutes. The good news: the monitoring you need to catch this is cheap and lives in places you already have.
Pre-flight checks (catch 90% of bad migrations)
- Estimate row count of affected tables; if > 1M and the migration rewrites the table, plan for downtime or use online tooling.
- Identify the lock level. Postgres docs list each ALTER’s lock; MySQL’s online DDL doc has the same. If it’s ACCESS EXCLUSIVE / TABLE LOCK, schedule it.
- Check disk free. Many ALTERs duplicate the table; you need 2-3× the table size in free space.
- Run the migration against a recent prod snapshot, not staging.
- For MySQL: prefer
ALGORITHM=INSTANTwhen supported, thenINPLACE, thenCOPYorpt-online-schema-change.
Live monitoring during the migration
-- Postgres: are we waiting on a lock? SELECT pid, query, wait_event_type, wait_event, state FROM pg_stat_activity WHERE datname = current_database() AND state != 'idle' AND query NOT ILIKE '%pg_stat_activity%'; -- Postgres: live progress (Pg 12+) SELECT * FROM pg_stat_progress_create_index; SELECT * FROM pg_stat_progress_cluster; -- MySQL: ALTER stage progress (5.7+) SELECT * FROM performance_schema.events_stages_current WHERE EVENT_NAME LIKE 'stage/innodb/%';
Post-deploy bake-off
Migration finished. Now compare query telemetry against pre-migration:
- Top 100 normalized queries — same signatures, same plan structures? Plan-flip detector catches regressions.
- p99 per signature — within ±20% of yesterday? If not, your new index is colder than the old one or it’s not being used.
- Lock-wait time — should be flat. A spike means a session is holding something open after migration.
- Connection pool depth — should be stable. App-side connection storms are a side effect of restarts after migration.
DB-specific gotchas
- Postgres CREATE INDEX CONCURRENTLY takes 2× as long but doesn’t lock writes. Always prefer it.
- Postgres ALTER TABLE … ADD COLUMN with default: in Pg 11+ this is INSTANT for non-volatile defaults. Pre-11, it rewrites the whole table.
- MySQL DROP INDEX: instant if it’s not the primary key. PK changes in 8.0.32+ are INSTANT for many cases.
- MongoDB collMod: most are non-blocking. Adding validation rules is instant; changing schema is collection-wide and shows in
currentOp. - ClickHouse ALTER … ADD COLUMN: instant on metadata. Backfilling values is a mutation that runs async; check
system.mutations.
FAQ
Should I always use online schema change tools?+
How do I cancel a running migration safely?+
Can I roll back?+
Keep reading
Postgres
Postgres bloat and autovacuum: a 2026 tuning guide
What table and index bloat actually costs you, how autovacuum works in 16+, the parameters that matter, and the queries to find your worst offenders before they trigger an OOM.
Postgres
Postgres slow queries: 12 causes and how to find each one
A field-tested playbook for diagnosing a slow Postgres query in production — from missing indexes to plan flips to bloated tables — with the SQL to find each cause and the fix.