38 · Monitoring & Profiling

serverStatus

Real-time server health metrics — the primary diagnostic command

health

// Full server status (large output — filter to what you need)
db.serverStatus()

// Key sections to watch:
db.serverStatus().connections
// { current: 47, available: 51153, totalCreated: 2841 }
// Watch: current near available → connection pool exhaustion

db.serverStatus().opcounters
// { insert: 1243, query: 89123, update: 3421, delete: 104, getmore: 521 }
// Watch: ratio of getmore → may indicate large result sets not paginated

db.serverStatus().mem
// { resident: 4096, virtual: 8192, mapped: 0 } (in MB)
// resident = RAM in use; should be < available RAM with headroom

db.serverStatus().wiredTiger.cache
// "bytes currently in cache": 3,221,225,472 (3GB)
// "maximum bytes configured": 4,294,967,296 (4GB WiredTiger cache max)
// "pages read into cache": pages evicted: ...  → high eviction = cache pressure

db.serverStatus().wiredTiger["transaction"]
// "transaction begins": total starts, "transaction rollbacks": watch for high rollbacks

db.serverStatus().locks
// Lock wait times by database — high waits indicate contention

db.serverStatus().globalLock
// currentQueue: { total, readers, writers } → HOW MANY OPS ARE WAITING FOR LOCKS

db.serverStatus().network
// bytesIn, bytesOut — total network throughput since last restart

// Watch specific metrics over time (poor man's monitoring):
(function pollStatus() {
  const s = db.serverStatus()
  print(new Date().toISOString(),
    "conns:", s.connections.current,
    "queue:", s.globalLock.currentQueue.total,
    "cache%:", Math.round(s.wiredTiger.cache["bytes currently in cache"] /
               s.wiredTiger.cache["maximum bytes configured"] * 100)
  )
})()

Key Metrics and Thresholds

Metric	Where	Concern Threshold
Active connections	`connections.current`	>80% of `connections.available`
Lock queue	`globalLock.currentQueue.total`	>0 consistently — investigate blocking op
WiredTiger cache	`wiredTiger.cache`	>95% full → eviction pressure → slowdown
Page faults	`extra_info.page_faults`	Rising → working set exceeds RAM
Replication lag	`rs.printSecondaryReplicationInfo()`	>10s under normal load

db.stats & Collection Stats

Storage and size metrics per database and collection

stats

// Database-level storage stats
db.stats()
// dataSize: uncompressed data bytes
// storageSize: actual bytes on disk (WiredTiger compressed — usually smaller)
// indexSize: bytes used by all indexes
// collections: number of collections
// objects: total document count

db.stats(1024 * 1024)   // output in MB (pass scale factor)

// Collection-level stats
db.orders.stats()
// count: document count
// avgObjSize: average document size in bytes — large = projection needed
// totalIndexSize: all index bytes for this collection
// indexSizes: { "index_name": bytes } — per-index breakdown
// storageSize: compressed on-disk bytes

db.orders.stats({ indexDetails: true })  // include WiredTiger cache stats per index

// Find largest collections (by uncompressed data size)
db.adminCommand({ listDatabases: 1 }).databases.forEach(d => {
  const s = db.getSiblingDB(d.name).stats()
  print(d.name, (s.dataSize / 1024 / 1024).toFixed(1) + "MB",
        "idx:", (s.indexSize / 1024 / 1024).toFixed(1) + "MB")
})

// Find largest documents in a specific collection
db.orders.aggregate([
  { $project: { bsonSize: { $bsonSize: "$$ROOT" }, _id: 1, customerId: 1 } },
  { $sort: { bsonSize: -1 } },
  { $limit: 5 }
])

// Working set estimation: how much RAM is needed to hold hot data in cache
// Check: indexSize + (N * P99 working doc size) for hot collections
// Rule of thumb: WiredTiger cache should hold all indexes + hot working set

currentOp & killOp

Find and terminate long-running operations

ops

// Show all active operations (requires clusterMonitor or higher)
db.currentOp()
db.currentOp({ active: true })           // only active (not idle connections)
db.currentOp({ secs_running: { $gt: 5 } })  // ops running > 5 seconds

// currentOp key fields per operation:
// opid:        operation ID (use to kill)
// op:          "query" | "update" | "insert" | "remove" | "command"
// ns:          "db.collection"
// secs_running: seconds elapsed
// waitingForLock: true/false — is it blocked?
// locks:       which locks held and at what granularity
// planSummary: "COLLSCAN" | "IXSCAN { field: 1 }" — quick index check
// command:     the actual command document (query, sort, etc.)

// Find operations causing lock contention
db.currentOp({ waitingForLock: true })

// Find long COLLSCAN queries
db.currentOp({
  "planSummary": /COLLSCAN/,
  "secs_running": { $gt: 10 }
})

// Kill a stuck operation (use opid from currentOp)
db.killOp(12345)
// Returns: { info: "attempting to kill op" }
// Note: killOp is async — the op may not stop immediately

// $currentOp aggregation (more powerful, more field access)
db.adminCommand({ aggregate: 1,
  pipeline: [
    { $currentOp: { allUsers: true, idleSessions: false } },
    { $match: { secs_running: { $gt: 30 } } },
    { $project: { opid: 1, op: 1, ns: 1, secs_running: 1, planSummary: 1, client: 1 } }
  ],
  cursor: {}
})

WARN

Killing an operation in the middle of a write may leave partial data. For multi-document updates or operations inside a transaction, MongoDB will roll back the partial changes on kill. For single-document updates (atomic), the document is either fully updated or not changed. Always investigate why an operation is slow before killing it.

Database Profiler

Log all or slow operations to system.profile

profiler

The database profiler captures query execution details to the system.profile capped collection. It has three levels: off, slow-only, all.

Level	Command	Captures	Overhead
0 (off)	`db.setProfilingLevel(0)`	Nothing	None
1 (slow)	`db.setProfilingLevel(1, { slowms: 100 })`	Operations > slowms	Low (production-safe)
2 (all)	`db.setProfilingLevel(2)`	Every operation	High — avoid in production

// Enable profiling at level 1 — log ops slower than 200ms
db.setProfilingLevel(1, { slowms: 200 })

// Check current profiling level
db.getProfilingLevel()   // returns: 1
db.getProfilingStatus()  // { was: 1, slowms: 200 }

// Configure in mongod.conf (persistent — survives restart):
// operationProfiling:
//   mode: slowOp             # off | slowOp | all
//   slowOpThresholdMs: 100   # log ops over 100ms

// profiling filter — only profile operations on specific collection (4.4.2+)
db.setProfilingLevel(1, {
  slowms: 0,    // capture all (filtered below)
  filter: {
    ns: "shopDb.orders",
    millis: { $gt: 50 }     // only orders queries > 50ms
  }
})

// system.profile is a capped collection (size configurable):
// Default size: 1MB (very small — rotate quickly under load)
// To change size: disable profiling, drop, recreate with new size:
db.setProfilingLevel(0)
db.system.profile.drop()
db.createCollection("system.profile", { capped: true, size: 52428800 })  // 50MB
db.setProfilingLevel(1, { slowms: 100 })

Querying system.profile

Extract slow query patterns from the profiler log

analysis

// system.profile key fields:
// op:         "query" | "update" | "insert" | "remove" | "command"
// ns:         "db.collection"
// millis:     execution time in milliseconds
// planSummary: "COLLSCAN" | "IXSCAN { field: 1 }"
// docsExamined: documents read from storage
// keysExamined: index keys scanned
// nreturned:  documents returned
// ts:         timestamp of operation
// command:    the actual query/command document

// Find slowest queries sorted by execution time
db.system.profile.find().sort({ millis: -1 }).limit(10)

// Find all COLLSCAN queries (missing index!)
db.system.profile.find({ planSummary: "COLLSCAN" })
  .sort({ millis: -1 })
  .limit(20)

// Find queries with high scan ratio (docsExamined >> nreturned)
db.system.profile.aggregate([
  {
    $match: {
      docsExamined: { $gt: 0 },
      nreturned: { $gt: 0 }
    }
  },
  {
    $project: {
      ns: 1,
      millis: 1,
      docsExamined: 1,
      nreturned: 1,
      ratio: { $divide: ["$docsExamined", "$nreturned"] },
      command: 1
    }
  },
  { $match: { ratio: { $gt: 100 } } },  // 100+ docs scanned per returned doc
  { $sort: { ratio: -1 } },
  { $limit: 10 }
])

// Top slow collections (aggregate by namespace)
db.system.profile.aggregate([
  { $group: {
    _id: "$ns",
    count:   { $sum: 1 },
    avgMs:   { $avg: "$millis" },
    maxMs:   { $max: "$millis" },
    totalMs: { $sum: "$millis" }
  }},
  { $sort: { totalMs: -1 } }
])

// Find all UPDATE operations missing an index
db.system.profile.find({
  op: "update",
  planSummary: "COLLSCAN",
  millis: { $gt: 50 }
}).projection({ ns: 1, millis: 1, "command.q": 1 })

mongostat & mongotop

Command-line real-time monitoring tools

CLI tools

mongostat prints per-second operation counts and server metrics. mongotop shows time spent per collection. Both are non-intrusive and safe in production.

// mongostat — print stats every second
mongostat --uri "mongodb://user:pass@host:27017/?authSource=admin"

// Output columns:
// insert/query/update/delete/getmore/command: ops per second
// dirty: WiredTiger dirty cache %
// used:  WiredTiger cache in use %
// flushes: WiredTiger checkpoint flushes per interval
// vsize/res: virtual/resident memory (MB)
// qrw: read/write queue depth (ops waiting for lock)
// arw: active read/write ops
// conn: connections
// net_in/net_out: network bytes
// time: timestamp

// mongostat every 5 seconds, stop after 60 samples:
mongostat --uri "..." 5 60

// mongostat for replica set — show all members:
mongostat --uri "mongodb://host1:27017,host2:27017/?replicaSet=myRS" --discover

// mongotop — time spent per collection (seconds interval)
mongotop --uri "mongodb://user:pass@host:27017/?authSource=admin" 2

// Output per collection:
// ns:       db.collection
// total:    total ms spent in last interval
// read:     ms on reads
// write:    ms on writes
// Collections with highest totals = hot collections → indexing priority

Key mongostat Warning Signs

Metric	Warning Sign	Likely Cause
`qrw` (queue)	Consistently > 0	Lock contention, slow queries blocking others
`dirty` (cache %)	> 20%	Write throughput exceeds checkpoint rate; consider reducing write load
`used` (cache %)	> 95%	Working set larger than cache; eviction pressure; add RAM or shard
`conn`	Growing without plateau	Connection leak in application; check connection pool config
`getmore`	High count	Clients not exhausting cursors; inefficient pagination

Slow Query Analysis

End-to-end workflow: detect → diagnose → fix

workflow

// Step 1: Find slow queries from profiler or logs
// MongoDB logs slow queries (>100ms default) to mongod.log:
// { "msg": "Slow query", "attr": { "ns": "shopDb.orders", "durationMillis": 4231,
//   "planSummary": "COLLSCAN", "command": { "find": "orders", ... } } }

// Step 2: Reproduce with explain()
db.orders.find({ status: "completed", customerId: "C001" })
  .explain("executionStats")

// Step 3: Check the plan — identify bottleneck stage
// COLLSCAN  → no index; create one
// IXSCAN but high docsExamined/nReturned ratio → index not selective enough
// SORT stage  → in-memory sort; add sort field to index
// FETCH stage on many docs → projection doesn't cover; consider covered index

// Step 4: Create the index
db.orders.createIndex({ customerId: 1, status: 1 })
// Re-run explain() to verify IXSCAN and improved stats

// Step 5: Check for unused indexes (overhead without benefit)
db.orders.aggregate([{ $indexStats: {} }])
// Look for: accesses.ops: 0 — never used since last restart

// Step 6: Monitor with Atlas Performance Advisor (Atlas only)
// Automatically suggests indexes based on recent slow queries
// Access: Atlas UI → Performance Advisor tab

// Ongoing: set alerting thresholds in Atlas / OpsManager:
// Alert if: avg query execution > 500ms over 5 min window
// Alert if: cache utilization > 90%
// Alert if: replication lag > 30s
// Alert if: connections > 80% of maxConnections

TIP

The fastest way to identify missing indexes in production is to grep the MongoDB log for "planSummary":"COLLSCAN" filtered by high durationMillis. The command field shows the full query with the filter predicates — these are your index candidates. Combine with $indexStats to verify the candidate doesn't already exist unused.