← back

Monitoring
& Profiling

FILE  38_monitoring_profiling
TOPIC  serverStatus · currentOp · Profiler · system.profile · mongostat · Slow Queries
LEVEL  Intermediate/Advanced
01
serverStatus
Real-time server health metrics — the primary diagnostic command
health
// Full server status (large output — filter to what you need)
db.serverStatus()

// Key sections to watch:
db.serverStatus().connections
// { current: 47, available: 51153, totalCreated: 2841 }
// Watch: current near available → connection pool exhaustion

db.serverStatus().opcounters
// { insert: 1243, query: 89123, update: 3421, delete: 104, getmore: 521 }
// Watch: ratio of getmore → may indicate large result sets not paginated

db.serverStatus().mem
// { resident: 4096, virtual: 8192, mapped: 0 } (in MB)
// resident = RAM in use; should be < available RAM with headroom

db.serverStatus().wiredTiger.cache
// "bytes currently in cache": 3,221,225,472 (3GB)
// "maximum bytes configured": 4,294,967,296 (4GB WiredTiger cache max)
// "pages read into cache": pages evicted: ...  → high eviction = cache pressure

db.serverStatus().wiredTiger["transaction"]
// "transaction begins": total starts, "transaction rollbacks": watch for high rollbacks

db.serverStatus().locks
// Lock wait times by database — high waits indicate contention

db.serverStatus().globalLock
// currentQueue: { total, readers, writers } → HOW MANY OPS ARE WAITING FOR LOCKS

db.serverStatus().network
// bytesIn, bytesOut — total network throughput since last restart

// Watch specific metrics over time (poor man's monitoring):
(function pollStatus() {
  const s = db.serverStatus()
  print(new Date().toISOString(),
    "conns:", s.connections.current,
    "queue:", s.globalLock.currentQueue.total,
    "cache%:", Math.round(s.wiredTiger.cache["bytes currently in cache"] /
               s.wiredTiger.cache["maximum bytes configured"] * 100)
  )
})()

Key Metrics and Thresholds

MetricWhereConcern Threshold
Active connectionsconnections.current>80% of connections.available
Lock queueglobalLock.currentQueue.total>0 consistently — investigate blocking op
WiredTiger cachewiredTiger.cache>95% full → eviction pressure → slowdown
Page faultsextra_info.page_faultsRising → working set exceeds RAM
Replication lagrs.printSecondaryReplicationInfo()>10s under normal load
02
db.stats & Collection Stats
Storage and size metrics per database and collection
stats
// Database-level storage stats
db.stats()
// dataSize: uncompressed data bytes
// storageSize: actual bytes on disk (WiredTiger compressed — usually smaller)
// indexSize: bytes used by all indexes
// collections: number of collections
// objects: total document count

db.stats(1024 * 1024)   // output in MB (pass scale factor)

// Collection-level stats
db.orders.stats()
// count: document count
// avgObjSize: average document size in bytes — large = projection needed
// totalIndexSize: all index bytes for this collection
// indexSizes: { "index_name": bytes } — per-index breakdown
// storageSize: compressed on-disk bytes

db.orders.stats({ indexDetails: true })  // include WiredTiger cache stats per index

// Find largest collections (by uncompressed data size)
db.adminCommand({ listDatabases: 1 }).databases.forEach(d => {
  const s = db.getSiblingDB(d.name).stats()
  print(d.name, (s.dataSize / 1024 / 1024).toFixed(1) + "MB",
        "idx:", (s.indexSize / 1024 / 1024).toFixed(1) + "MB")
})

// Find largest documents in a specific collection
db.orders.aggregate([
  { $project: { bsonSize: { $bsonSize: "$$ROOT" }, _id: 1, customerId: 1 } },
  { $sort: { bsonSize: -1 } },
  { $limit: 5 }
])

// Working set estimation: how much RAM is needed to hold hot data in cache
// Check: indexSize + (N * P99 working doc size) for hot collections
// Rule of thumb: WiredTiger cache should hold all indexes + hot working set
03
currentOp & killOp
Find and terminate long-running operations
ops
// Show all active operations (requires clusterMonitor or higher)
db.currentOp()
db.currentOp({ active: true })           // only active (not idle connections)
db.currentOp({ secs_running: { $gt: 5 } })  // ops running > 5 seconds

// currentOp key fields per operation:
// opid:        operation ID (use to kill)
// op:          "query" | "update" | "insert" | "remove" | "command"
// ns:          "db.collection"
// secs_running: seconds elapsed
// waitingForLock: true/false — is it blocked?
// locks:       which locks held and at what granularity
// planSummary: "COLLSCAN" | "IXSCAN { field: 1 }" — quick index check
// command:     the actual command document (query, sort, etc.)

// Find operations causing lock contention
db.currentOp({ waitingForLock: true })

// Find long COLLSCAN queries
db.currentOp({
  "planSummary": /COLLSCAN/,
  "secs_running": { $gt: 10 }
})

// Kill a stuck operation (use opid from currentOp)
db.killOp(12345)
// Returns: { info: "attempting to kill op" }
// Note: killOp is async — the op may not stop immediately

// $currentOp aggregation (more powerful, more field access)
db.adminCommand({ aggregate: 1,
  pipeline: [
    { $currentOp: { allUsers: true, idleSessions: false } },
    { $match: { secs_running: { $gt: 30 } } },
    { $project: { opid: 1, op: 1, ns: 1, secs_running: 1, planSummary: 1, client: 1 } }
  ],
  cursor: {}
})
WARN
Killing an operation in the middle of a write may leave partial data. For multi-document updates or operations inside a transaction, MongoDB will roll back the partial changes on kill. For single-document updates (atomic), the document is either fully updated or not changed. Always investigate why an operation is slow before killing it.
04
Database Profiler
Log all or slow operations to system.profile
profiler

The database profiler captures query execution details to the system.profile capped collection. It has three levels: off, slow-only, all.

LevelCommandCapturesOverhead
0 (off)db.setProfilingLevel(0)NothingNone
1 (slow)db.setProfilingLevel(1, { slowms: 100 })Operations > slowmsLow (production-safe)
2 (all)db.setProfilingLevel(2)Every operationHigh — avoid in production
// Enable profiling at level 1 — log ops slower than 200ms
db.setProfilingLevel(1, { slowms: 200 })

// Check current profiling level
db.getProfilingLevel()   // returns: 1
db.getProfilingStatus()  // { was: 1, slowms: 200 }

// Configure in mongod.conf (persistent — survives restart):
// operationProfiling:
//   mode: slowOp             # off | slowOp | all
//   slowOpThresholdMs: 100   # log ops over 100ms

// profiling filter — only profile operations on specific collection (4.4.2+)
db.setProfilingLevel(1, {
  slowms: 0,    // capture all (filtered below)
  filter: {
    ns: "shopDb.orders",
    millis: { $gt: 50 }     // only orders queries > 50ms
  }
})

// system.profile is a capped collection (size configurable):
// Default size: 1MB (very small — rotate quickly under load)
// To change size: disable profiling, drop, recreate with new size:
db.setProfilingLevel(0)
db.system.profile.drop()
db.createCollection("system.profile", { capped: true, size: 52428800 })  // 50MB
db.setProfilingLevel(1, { slowms: 100 })
05
Querying system.profile
Extract slow query patterns from the profiler log
analysis
// system.profile key fields:
// op:         "query" | "update" | "insert" | "remove" | "command"
// ns:         "db.collection"
// millis:     execution time in milliseconds
// planSummary: "COLLSCAN" | "IXSCAN { field: 1 }"
// docsExamined: documents read from storage
// keysExamined: index keys scanned
// nreturned:  documents returned
// ts:         timestamp of operation
// command:    the actual query/command document

// Find slowest queries sorted by execution time
db.system.profile.find().sort({ millis: -1 }).limit(10)

// Find all COLLSCAN queries (missing index!)
db.system.profile.find({ planSummary: "COLLSCAN" })
  .sort({ millis: -1 })
  .limit(20)

// Find queries with high scan ratio (docsExamined >> nreturned)
db.system.profile.aggregate([
  {
    $match: {
      docsExamined: { $gt: 0 },
      nreturned: { $gt: 0 }
    }
  },
  {
    $project: {
      ns: 1,
      millis: 1,
      docsExamined: 1,
      nreturned: 1,
      ratio: { $divide: ["$docsExamined", "$nreturned"] },
      command: 1
    }
  },
  { $match: { ratio: { $gt: 100 } } },  // 100+ docs scanned per returned doc
  { $sort: { ratio: -1 } },
  { $limit: 10 }
])

// Top slow collections (aggregate by namespace)
db.system.profile.aggregate([
  { $group: {
    _id: "$ns",
    count:   { $sum: 1 },
    avgMs:   { $avg: "$millis" },
    maxMs:   { $max: "$millis" },
    totalMs: { $sum: "$millis" }
  }},
  { $sort: { totalMs: -1 } }
])

// Find all UPDATE operations missing an index
db.system.profile.find({
  op: "update",
  planSummary: "COLLSCAN",
  millis: { $gt: 50 }
}).projection({ ns: 1, millis: 1, "command.q": 1 })
06
mongostat & mongotop
Command-line real-time monitoring tools
CLI tools

mongostat prints per-second operation counts and server metrics. mongotop shows time spent per collection. Both are non-intrusive and safe in production.

// mongostat — print stats every second
mongostat --uri "mongodb://user:pass@host:27017/?authSource=admin"

// Output columns:
// insert/query/update/delete/getmore/command: ops per second
// dirty: WiredTiger dirty cache %
// used:  WiredTiger cache in use %
// flushes: WiredTiger checkpoint flushes per interval
// vsize/res: virtual/resident memory (MB)
// qrw: read/write queue depth (ops waiting for lock)
// arw: active read/write ops
// conn: connections
// net_in/net_out: network bytes
// time: timestamp

// mongostat every 5 seconds, stop after 60 samples:
mongostat --uri "..." 5 60

// mongostat for replica set — show all members:
mongostat --uri "mongodb://host1:27017,host2:27017/?replicaSet=myRS" --discover

// mongotop — time spent per collection (seconds interval)
mongotop --uri "mongodb://user:pass@host:27017/?authSource=admin" 2

// Output per collection:
// ns:       db.collection
// total:    total ms spent in last interval
// read:     ms on reads
// write:    ms on writes
// Collections with highest totals = hot collections → indexing priority

Key mongostat Warning Signs

MetricWarning SignLikely Cause
qrw (queue)Consistently > 0Lock contention, slow queries blocking others
dirty (cache %)> 20%Write throughput exceeds checkpoint rate; consider reducing write load
used (cache %)> 95%Working set larger than cache; eviction pressure; add RAM or shard
connGrowing without plateauConnection leak in application; check connection pool config
getmoreHigh countClients not exhausting cursors; inefficient pagination
07
Slow Query Analysis
End-to-end workflow: detect → diagnose → fix
workflow
// Step 1: Find slow queries from profiler or logs
// MongoDB logs slow queries (>100ms default) to mongod.log:
// { "msg": "Slow query", "attr": { "ns": "shopDb.orders", "durationMillis": 4231,
//   "planSummary": "COLLSCAN", "command": { "find": "orders", ... } } }

// Step 2: Reproduce with explain()
db.orders.find({ status: "completed", customerId: "C001" })
  .explain("executionStats")

// Step 3: Check the plan — identify bottleneck stage
// COLLSCAN  → no index; create one
// IXSCAN but high docsExamined/nReturned ratio → index not selective enough
// SORT stage  → in-memory sort; add sort field to index
// FETCH stage on many docs → projection doesn't cover; consider covered index

// Step 4: Create the index
db.orders.createIndex({ customerId: 1, status: 1 })
// Re-run explain() to verify IXSCAN and improved stats

// Step 5: Check for unused indexes (overhead without benefit)
db.orders.aggregate([{ $indexStats: {} }])
// Look for: accesses.ops: 0 — never used since last restart

// Step 6: Monitor with Atlas Performance Advisor (Atlas only)
// Automatically suggests indexes based on recent slow queries
// Access: Atlas UI → Performance Advisor tab

// Ongoing: set alerting thresholds in Atlas / OpsManager:
// Alert if: avg query execution > 500ms over 5 min window
// Alert if: cache utilization > 90%
// Alert if: replication lag > 30s
// Alert if: connections > 80% of maxConnections
TIP
The fastest way to identify missing indexes in production is to grep the MongoDB log for "planSummary":"COLLSCAN" filtered by high durationMillis. The command field shows the full query with the filter predicates — these are your index candidates. Combine with $indexStats to verify the candidate doesn't already exist unused.