Obsfly
mongodb / monitoring-surfaces · 4liveserverStatus()host metricsdb.stats()storage / indexcurrentOp()live activitysystem.profileslow op log

MongoDB

生产环境 MongoDB 性能监控:2026 指南

四个数据面 (serverStatus、db.stats、currentOp、profiler),合理的默认配置,以及如何思考副本延迟、oplog 窗口与聚合管道成本。

Published ·14 min read

MongoDB monitoring is split across four surfaces — serverStatus(), db.stats(), currentOp(), and the profiler. Each tells a different story; none alone is enough. This is what we scrape from each in production, and how to reason about replica lag, oplog window, and aggregation pipeline cost.

On this page
  1. The four surfaces
  2. serverStatus — host metrics
  3. db.stats — per-database storage
  4. currentOp — live activity
  5. Profiler — slow query log
  6. Replica set metrics
  7. Sharded cluster metrics
  8. Aggregation pipeline performance
  9. FAQ

The four surfaces

SurfaceCardinalityCostUse it for
serverStatus()1 doc / callCheapHost-level rollups: connections, opcounters, WiredTiger cache, network
db.stats()1 doc / databaseCheapStorage size, index size, collection count
currentOp()N docs / callMediumLive in-flight ops, lock waits, slow op detection
ProfilerContinuousMedium-highPersisted slow-op log per database (system.profile)

serverStatus — host metrics

Run from admin. The whole document is huge; the fields worth scraping every 15s are bounded.

db.adminCommand({ serverStatus: 1 })

// extract:
opcounters.{insert,query,update,delete,getmore,command}
opcountersRepl.*                                // for replicas
connections.{current,available,totalCreated}
network.{bytesIn,bytesOut,numRequests}
wiredTiger.cache.{
  "bytes currently in the cache",
  "tracked dirty bytes in the cache",
  "pages evicted by application threads",
  "unmodified pages evicted",
  "modified pages evicted"
}
locks.Global.acquireCount.{r,w,R,W}
asserts.{regular,warning,msg,user,rollovers}
metrics.queryExecutor.scanned                   // collection scans
metrics.queryExecutor.scannedObjects

db.stats — per-database storage

db.stats()

// fields worth tracking:
collections, indexes, dataSize, storageSize, indexSize, totalSize

Track these per-database, daily. Sudden growth in indexSize usually means someone added an index that doesn’t fit in cache; sudden growth in storageSize without document growth means fragmentation.

currentOp — live activity

The 1 Hz poll for live ops. Filter aggressively or you’ll DoS your own monitoring.

db.currentOp({
  active: true,
  $or: [
    { secs_running: { $gt: 1 } },               // > 1s
    { "lockStats.acquireWaitCount.r": { $gt: 0 } },
    { "lockStats.acquireWaitCount.w": { $gt: 0 } }
  ]
})

For each op, scrape:

  • opid, op, ns (namespace = db.collection)
  • secs_running — wall-clock duration so far
  • command — the BSON of the actual operation
  • planSummary — index hint + plan stage names
  • waitingForLock, lockStats — lock waits per scope (Global / Database / Collection)

Profiler — slow query log

Set per-database. The profiler writes to system.profile in that database. Level 1 means “log slow ops only.”

db.setProfilingLevel(1, { slowms: 100, sampleRate: 1.0 })

// then read:
db.system.profile
  .find({ millis: { $gt: 200 } })
  .sort({ ts: -1 }).limit(50)

For each profile entry, the meaningful fields are ts, op, ns, command, planSummary, keysExamined, docsExamined, nreturned, millis, and writeConflicts.

High docsExamined with low nreturned = missing index. High writeConflicts = MongoDB’s deadlock-equivalent on an unsharded write-heavy collection.

Replica set metrics

rs.status()

// derive:
members[i].health, members[i].state, members[i].uptime
members[i].optimeDate                           // for replication lag
SECONDARY_optime - PRIMARY_optime               // = lag, in seconds
  • Track lag as maxLag across all secondaries — alert on it.
  • Oplog window: db.getReplicationInfo().tFirst to tLast. If a secondary is > window behind, it falls off and needs a resync.
  • Election count from serverStatus.electionMetrics — repeated elections mean instability.

Sharded cluster metrics

From mongos:

sh.status()

// extract per shard:
chunks count
moveChunk activity (from changelog)
balancer state
config.collections (sharded collections, shard keys)

The most common sharded-cluster pathology is a hot shard — one shard handling most of the writes. Look at per-shard opcounters and chunk counts.

Aggregation pipeline performance

$group, $lookup, and $unwind stages are where aggregations become expensive. Use $indexStats and explain():

db.orders.explain("executionStats").aggregate([
  { $match: { customerId: ObjectId(...) } },
  { $group: { _id: "$status", n: { $sum: 1 } } }
])

// look for:
executionStats.totalDocsExamined / nReturned     // ratio < 10 ideal
stages[i].executionTimeMillisEstimate            // hot stages

FAQ

Profiler vs scraping currentOp — which is the source of truth?+
Profiler is persistent, currentOp is the live tip. Use both: profiler for postmortems and slow-query analysis, currentOp at 1 Hz for real-time activity dashboards.
Profiler overhead?+
Level 1 (slow ops only) with a sane slowms threshold (100–500ms) adds <2% on most workloads. Level 2 (all ops) is for diagnosis only.
MongoDB Atlas — does monitoring change?+
Atlas exposes a managed Performance Advisor with a UI on top of the same surfaces. The profile collection is read-only on free tiers.
How does Obsfly MongoDB integration work?+
Same Go agent. Connects via the standard MongoDB driver, scrapes the four surfaces at configurable intervals, ships envelopes to the receiver. Read-only credentials are enough.

Keep reading

· · ·

像监控服务一样监控你的数据库。

预约 30 分钟演示。我们一起规划你的数据库规模,并报出第一个 30 天合作的报价。

生产环境 MongoDB 性能监控:2026 指南 · Obsfly