Obsfly
mongodb / sharding · 4 shards · 525 chunkslivemongos · routershard-01rs0:27017142 chunksshard-02rs1:27017138 chunksshard-03rs2:27017156 chunksshard-04rs3:2701789 chunks

MongoDB

Sharded-MongoDB-Monitoring: die Metriken, die ein Ungleichgewicht vorhersagen

Chunk-Verteilung, Jumbo-Chunks, Balancer-Round-Time, heiße Shards. Die wenigen Metriken, die einen gesunden Sharded-Cluster von einem unterscheiden, der gleich rebalanciert werden muss.

Published ·12 min read

Sharded MongoDB is a different beast from a replica set. The dashboards you have show per-shard health; the failures usually come from between the shards. Imbalance, jumbo chunks, balancer livelock, and shard-key skew don’t show up on a per-shard graph. Here’s the cluster-level monitoring set you actually need.

On this page
  1. Cluster-level metrics
  2. Balancer health
  3. Jumbo chunks
  4. Shard-key skew
  5. Alerts
  6. FAQ

Cluster-level metrics

// On any mongos:
db.adminCommand({ balancerStatus: 1 })
sh.status()
db.getSiblingDB("config").chunks.aggregate([
  { $group: { _id: "$shard", chunks: { $sum: 1 } } }
])

That last one is the one that surprises teams: the chunk count by shard. Healthy clusters are within ±5%; problematic ones are 30%+ apart and the balancer has given up.

Balancer health

  • Round time — balancer rounds typically take seconds. Rounds taking minutes mean a single migration is stuck.
  • Migrations failed (last hour) — should be 0; sustained failures = chunk that won’t move (jumbo, write conflict, lock contention).
  • Active migrations — only 1 per source shard at a time; if 0 for 24h on an imbalanced cluster, balancer is paused or stuck.

Jumbo chunks

A chunk that exceeds chunkSize (default 128MB) and can’t be split is marked jumbo. The balancer refuses to migrate it. Find them:

db.getSiblingDB("config").chunks.find({ jumbo: true })
  .sort({ "min": 1 })
  .toArray()

The fix depends on cause: bad shard key (the most common — can’t fix without resharding), or a single document > chunk size (rare; consider splitting the document). On 4.4+, refineCollectionShardKey can help by adding suffix fields to make ranges splittable.

Shard-key skew

Hot shard means your shard key concentrates writes onto one range. Test:

// Top 10 shard keys by write rate over the last hour
db.serverStatus().opcounters
// per-shard, computed via mongotop or collected via metrics scrape

// Compare ops/sec across shards. > 2× spread = hot shard.

Recovery options, in increasing intrusiveness: hash existing key, refine the shard key, reshard (online resharding is GA from 5.0). All require planning.

Alerts that matter

  • Chunk count spread > 30% across shards
  • Balancer failed migrations > 0 sustained 1h
  • Any chunk marked jumbo (immediate)
  • Per-shard ops/sec spread > 2× median
  • Per-shard storage spread > 2× median
  • config.chunks rebalance round time > 5min sustained

FAQ

How big should chunkSize be?+
128 MB is the default. Raising to 256 MB reduces migration churn at the cost of bigger jumps when migrations do happen. Don't drop below 64 MB without good reason.
Does Atlas Performance Advisor catch jumbo chunks?+
No. PA is shard-local. Cluster-wide signals require explicit instrumentation.
Can I monitor sharding from the application?+
Indirectly — track per-shard latency on a representative read. Drift on one shard signals balancer or capacity issues before the cluster-level metrics catch up.

Keep reading

· · ·

Überwache deine Datenbanken wie deine Services.

Buche eine 30-minütige Demo. Wir besprechen deine Flotte und erstellen ein 30-Tage-Angebot.

Sharded-MongoDB-Monitoring: die Metriken, die ein Ungleichgewicht vorhersagen · Obsfly