MongoDB
Sharded MongoDB monitoring: the metrics that predict an imbalance
Chunk distribution, jumbo chunks, balancer round time, hot shards. The handful of metrics that distinguish a healthy sharded cluster from one that's about to need a rebalance party.
Sharded MongoDB is a different beast from a replica set. The dashboards you have show per-shard health; the failures usually come from between the shards. Imbalance, jumbo chunks, balancer livelock, and shard-key skew don’t show up on a per-shard graph. Here’s the cluster-level monitoring set you actually need.
Cluster-level metrics
// On any mongos:
db.adminCommand({ balancerStatus: 1 })
sh.status()
db.getSiblingDB("config").chunks.aggregate([
{ $group: { _id: "$shard", chunks: { $sum: 1 } } }
])That last one is the one that surprises teams: the chunk count by shard. Healthy clusters are within ±5%; problematic ones are 30%+ apart and the balancer has given up.
Balancer health
- Round time — balancer rounds typically take seconds. Rounds taking minutes mean a single migration is stuck.
- Migrations failed (last hour) — should be 0; sustained failures = chunk that won’t move (jumbo, write conflict, lock contention).
- Active migrations — only 1 per source shard at a time; if 0 for 24h on an imbalanced cluster, balancer is paused or stuck.
Jumbo chunks
A chunk that exceeds chunkSize (default 128MB) and can’t be split is marked jumbo. The balancer refuses to migrate it. Find them:
db.getSiblingDB("config").chunks.find({ jumbo: true })
.sort({ "min": 1 })
.toArray()The fix depends on cause: bad shard key (the most common — can’t fix without resharding), or a single document > chunk size (rare; consider splitting the document). On 4.4+, refineCollectionShardKey can help by adding suffix fields to make ranges splittable.
Shard-key skew
Hot shard means your shard key concentrates writes onto one range. Test:
// Top 10 shard keys by write rate over the last hour db.serverStatus().opcounters // per-shard, computed via mongotop or collected via metrics scrape // Compare ops/sec across shards. > 2× spread = hot shard.
Recovery options, in increasing intrusiveness: hash existing key, refine the shard key, reshard (online resharding is GA from 5.0). All require planning.
Alerts that matter
- Chunk count spread > 30% across shards
- Balancer failed migrations > 0 sustained 1h
- Any chunk marked jumbo (immediate)
- Per-shard ops/sec spread > 2× median
- Per-shard storage spread > 2× median
- config.chunks rebalance round time > 5min sustained
FAQ
How big should chunkSize be?+
Does Atlas Performance Advisor catch jumbo chunks?+
Can I monitor sharding from the application?+
Keep reading
MongoDB
MongoDB performance monitoring in production: a 2026 guide
Four surfaces (serverStatus, db.stats, currentOp, profiler), a sane default for what to scrape from each, and how to reason about replica lag, oplog window, and aggregation pipeline cost.
AI
Anomaly detection on database metrics: why thresholds fail and what works
A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.