MongoDB

生产环境 MongoDB 性能监控:2026 指南

四个数据面 (serverStatus、db.stats、currentOp、profiler),合理的默认配置,以及如何思考副本延迟、oplog 窗口与聚合管道成本。

Published 2026-04-08·14 min read

MongoDB monitoring is split across four surfaces — serverStatus(), db.stats(), currentOp(), and the profiler. Each tells a different story; none alone is enough. This is what we scrape from each in production, and how to reason about replica lag, oplog window, and aggregation pipeline cost.

On this page

The four surfaces
serverStatus — host metrics
db.stats — per-database storage
currentOp — live activity
Profiler — slow query log
Replica set metrics
Sharded cluster metrics
Aggregation pipeline performance
FAQ

The four surfaces

Surface	Cardinality	Cost	Use it for
serverStatus()	1 doc / call	Cheap	Host-level rollups: connections, opcounters, WiredTiger cache, network
db.stats()	1 doc / database	Cheap	Storage size, index size, collection count
currentOp()	N docs / call	Medium	Live in-flight ops, lock waits, slow op detection
Profiler	Continuous	Medium-high	Persisted slow-op log per database (system.profile)

serverStatus — host metrics

Run from admin. The whole document is huge; the fields worth scraping every 15s are bounded.

db.adminCommand({ serverStatus: 1 })

// extract:
opcounters.{insert,query,update,delete,getmore,command}
opcountersRepl.*                                // for replicas
connections.{current,available,totalCreated}
network.{bytesIn,bytesOut,numRequests}
wiredTiger.cache.{
  "bytes currently in the cache",
  "tracked dirty bytes in the cache",
  "pages evicted by application threads",
  "unmodified pages evicted",
  "modified pages evicted"
}
locks.Global.acquireCount.{r,w,R,W}
asserts.{regular,warning,msg,user,rollovers}
metrics.queryExecutor.scanned                   // collection scans
metrics.queryExecutor.scannedObjects

db.stats — per-database storage

db.stats()

// fields worth tracking:
collections, indexes, dataSize, storageSize, indexSize, totalSize

Track these per-database, daily. Sudden growth in indexSize usually means someone added an index that doesn’t fit in cache; sudden growth in storageSize without document growth means fragmentation.

currentOp — live activity

The 1 Hz poll for live ops. Filter aggressively or you’ll DoS your own monitoring.

db.currentOp({
  active: true,
  $or: [
    { secs_running: { $gt: 1 } },               // > 1s
    { "lockStats.acquireWaitCount.r": { $gt: 0 } },
    { "lockStats.acquireWaitCount.w": { $gt: 0 } }
  ]
})

For each op, scrape:

opid, op, ns (namespace = db.collection)
secs_running — wall-clock duration so far
command — the BSON of the actual operation
planSummary — index hint + plan stage names
waitingForLock, lockStats — lock waits per scope (Global / Database / Collection)

Profiler — slow query log

Set per-database. The profiler writes to system.profile in that database. Level 1 means “log slow ops only.”

db.setProfilingLevel(1, { slowms: 100, sampleRate: 1.0 })

// then read:
db.system.profile
  .find({ millis: { $gt: 200 } })
  .sort({ ts: -1 }).limit(50)

For each profile entry, the meaningful fields are ts, op, ns, command, planSummary, keysExamined, docsExamined, nreturned, millis, and writeConflicts.

High docsExamined with low nreturned = missing index. High writeConflicts = MongoDB’s deadlock-equivalent on an unsharded write-heavy collection.

Replica set metrics

rs.status()

// derive:
members[i].health, members[i].state, members[i].uptime
members[i].optimeDate                           // for replication lag
SECONDARY_optime - PRIMARY_optime               // = lag, in seconds

Track lag as maxLag across all secondaries — alert on it.
Oplog window: db.getReplicationInfo().tFirst to tLast. If a secondary is > window behind, it falls off and needs a resync.
Election count from serverStatus.electionMetrics — repeated elections mean instability.

Sharded cluster metrics

From mongos:

sh.status()

// extract per shard:
chunks count
moveChunk activity (from changelog)
balancer state
config.collections (sharded collections, shard keys)

The most common sharded-cluster pathology is a hot shard — one shard handling most of the writes. Look at per-shard opcounters and chunk counts.

Aggregation pipeline performance

$group, $lookup, and $unwind stages are where aggregations become expensive. Use $indexStats and explain():

db.orders.explain("executionStats").aggregate([
  { $match: { customerId: ObjectId(...) } },
  { $group: { _id: "$status", n: { $sum: 1 } } }
])

// look for:
executionStats.totalDocsExamined / nReturned     // ratio < 10 ideal
stages[i].executionTimeMillisEstimate            // hot stages

FAQ

Profiler vs scraping currentOp — which is the source of truth?+

Profiler is persistent, currentOp is the live tip. Use both: profiler for postmortems and slow-query analysis, currentOp at 1 Hz for real-time activity dashboards.

Profiler overhead?+

Level 1 (slow ops only) with a sane slowms threshold (100–500ms) adds <2% on most workloads. Level 2 (all ops) is for diagnosis only.

MongoDB Atlas — does monitoring change?+

Atlas exposes a managed Performance Advisor with a UI on top of the same surfaces. The profile collection is read-only on free tiers.

How does Obsfly MongoDB integration work?+

Same Go agent. Connects via the standard MongoDB driver, scrapes the four surfaces at configurable intervals, ships envelopes to the receiver. Read-only credentials are enough.

Keep reading

MySQL

MySQL Performance Schema vs sys schema: a 2026 monitoring guide

Performance Schema is unreadable. sys schema is friendly but lossy. Here's exactly which to use for which production question, with the eight queries every MySQL DBA should know by heart.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

← All posts