34 · Sharding

Sharding Architecture

Three-component cluster: mongos · config servers · shards

architecture

Sharding is MongoDB's mechanism for horizontal scaling — distributing data across multiple servers (shards). Each shard is a replica set. The cluster has three components:

mongos: Query router — clients connect to mongos (not shards directly). Routes queries to correct shard(s). Stateless; multiple can run in parallel.
Config Servers: Store the cluster's metadata — which chunks of data are on which shard. Always a replica set (3 nodes). Critical — loss of all config servers stops the cluster.
Shards: Each shard is a replica set holding a subset of the data. Add more shards to scale out storage and write throughput.

// Connect to mongos (not directly to shard mongod)
mongosh "mongodb://mongos1:27017,mongos2:27017/mydb"

// Enable sharding on a database
sh.enableSharding("mydb")

// Shard a collection (must happen before data is inserted for best distribution)
sh.shardCollection("mydb.orders", { customerId: 1, createdAt: 1 })

// Check sharding status
sh.status()           // overview of shards, databases, collections, chunks
db.printShardingStatus()

Shard Key Selection

The most important decision in a sharded cluster

shard key

The shard key determines how MongoDB distributes documents across shards. It is immutable per collection after sharding (reshard operation available from 5.0+). A bad shard key cannot be easily fixed.

Shard Key Requirements

Property	Requirement	Problem if Violated
Cardinality	Many distinct values	Too few chunks → cannot distribute evenly
Frequency	No single value dominates	One shard gets all docs with that value → hotspot
Monotonicity	Value should NOT always increase	All inserts hit the same "last" chunk → write hotspot
Query targeting	Include shard key in common queries	Queries broadcast to ALL shards (scatter-gather)

// ❌ BAD shard keys
{ shardKey: { _id: 1 } }            // ObjectId is monotonic → write hotspot
{ shardKey: { createdAt: 1 } }      // timestamp monotonic → write hotspot
{ shardKey: { status: 1 } }         // 3-5 values → low cardinality
{ shardKey: { country: 1 } }        // 60% "US" → hotspot

// ✅ GOOD shard keys
{ shardKey: { userId: "hashed" } }  // hashed: even distribution, any cardinality
{ shardKey: { customerId: 1, createdAt: 1 } }  // compound: targeted + spread
{ shardKey: { region: 1, userId: 1 } }         // zone-friendly

DANGER

The shard key field value is immutable per document in MongoDB < 4.2. In 4.2+, you can update the shard key value (with limitations), but the shard key definition (which fields) cannot change after shardCollection(). Resharding (changing the key) is available in 5.0+ but is expensive. Choose carefully before sharding a collection with data.

Hashed vs Ranged Sharding

Two strategies with fundamentally different trade-offs

strategies

Feature	Ranged Sharding	Hashed Sharding
Distribution	By value range — adjacent values on same shard	By hash of value — pseudo-random, even distribution
Range queries	Efficient — one shard has the whole range	Scatter-gather — range spans multiple shards
Write hotspots	Possible (monotonic values go to last shard)	Eliminated (hashed values spread across shards)
Shard key type	Any orderable field	`{ field: "hashed" }` syntax
Float precision	Normal	2.3 and 2.9 hash identically (truncated to int)

// Ranged sharding: good for range queries, risk of write hotspot
sh.shardCollection("mydb.orders", { createdAt: 1 })
// All Jan 2024 orders on shard 1, Feb 2024 on shard 2, etc.
// Date range query → only 1 shard contacted (efficient)
// Current-date inserts → all writes hit newest-range shard (hotspot!)

// Hashed sharding: even write distribution, no range efficiency
sh.shardCollection("mydb.users", { userId: "hashed" })
// userId "abc123" hashes to shard 2, "def456" hashes to shard 0, etc.
// All inserts spread evenly (no hotspot)
// Range query on userId → scatter-gather all shards

// Compound shard key: balance both concerns
sh.shardCollection("mydb.events", { userId: 1, createdAt: 1 })
// userId: high cardinality, targeted per-user queries
// createdAt: secondary field spreads within user's range

Chunks & Balancing

How data units are split and migrated between shards

chunks

MongoDB distributes data at the granularity of chunks — contiguous ranges of shard key values. Each chunk lives on one shard. As a chunk grows past the chunk size limit, it is split. When shards have an imbalanced chunk count, the balancer migrates chunks to even the distribution.

// Default chunk size: 128MB (configurable)
// Set chunk size (in MB) — applies to future splits:
use config
db.settings.updateOne(
  { _id: "chunksize" },
  { $set: { value: 64 } },
  { upsert: true }
)

// View chunks for a collection:
use config
db.chunks.find({ ns: "mydb.orders" }).sort({ min: 1 })
// Each chunk: { _id, ns, min (shard key range start),
//               max (range end), shard (which shard holds it) }

// Pre-split chunks before bulk import (avoid split storms):
sh.splitAt("mydb.orders", { customerId: "C5000" })  // split at specific value
sh.splitFind("mydb.orders", { customerId: "C5000" }) // split at doc's shard key value

Balancer

// Balancer runs automatically in background — migrates chunks for even distribution
// Check balancer status:
sh.getBalancerState()      // true/false
sh.isBalancerRunning()     // currently migrating?

// Disable balancer during bulk operations (avoid migration overhead):
sh.stopBalancer()
// ... bulk import ...
sh.startBalancer()

// Set balancer window (only run during off-peak hours):
db.adminCommand({
  balancerSchedule: {
    start: "23:00",   // 11 PM
    stop:  "06:00"    // 6 AM
  }
})

Zone Sharding

Route specific shard key ranges to specific shards

zones

Zone sharding assigns shard key value ranges to specific shards. This enables geographic data placement (EU data stays in EU shards), tiered storage (hot data on NVMe, cold data on HDD), and multi-tenant isolation.

// Geographic zone sharding: EU users on EU shard, US users on US shard

// 1. Add zone tag to shards
sh.addShardTag("shard-eu-1", "EU")   // or: sh.addShardToZone()
sh.addShardTag("shard-eu-2", "EU")
sh.addShardTag("shard-us-1", "US")
sh.addShardTag("shard-us-2", "US")

// 2. Assign shard key ranges to zones
// Shard key: { region: 1, userId: 1 }
sh.addTagRange(
  "mydb.users",
  { region: "eu", userId: MinKey },  // min bound
  { region: "eu", userId: MaxKey },  // max bound
  "EU"                               // zone name
)
sh.addTagRange(
  "mydb.users",
  { region: "us", userId: MinKey },
  { region: "us", userId: MaxKey },
  "US"
)
// Balancer ensures EU-tagged chunks only live on EU-zone shards

// Tiered storage: hot data on fast shards
sh.addShardTag("shard-nvme-1", "HOT")
sh.addTagRange("mydb.events",
  { createdAt: ISODate("2024-01-01") }, { createdAt: MaxKey }, "HOT")
// Events after Jan 2024 (hot/recent) → NVMe shards
// Older events → default (HDD) shards

Query Routing

Targeted queries vs scatter-gather — the performance difference

routing

mongos routes queries to shards. Whether it contacts 1 shard or all shards depends on whether the query includes the shard key.

// Shard key: { customerId: 1, createdAt: 1 }

// ✅ TARGETED — includes shard key prefix → 1 shard contacted
db.orders.find({ customerId: "C001" })
db.orders.find({ customerId: "C001", createdAt: { $gte: ISODate("2024-01-01") } })
// mongos reads chunk map → knows exactly which shard(s) hold C001's data

// ❌ SCATTER-GATHER — no shard key → ALL shards contacted, results merged
db.orders.find({ status: "active" })      // status not in shard key
db.orders.find({ amount: { $gt: 100 } }) // amount not in shard key
// mongos sends query to all N shards, collects results, merges → expensive

Using explain() to Verify Routing

// Check if a query is targeted or scatter-gather:
db.orders.find({ customerId: "C001" }).explain("executionStats")
// Look for: "shards" section in explain output
// shards: { "shard-1": { ... } }  → targeted (only 1 shard)
// shards: { "shard-1": ..., "shard-2": ..., "shard-3": ... } → scatter-gather

// Alternatively: explain().queryPlanner.winningPlan.shards.length

WARN

Scatter-gather queries scale poorly — adding more shards makes these queries slower, not faster, because mongos must contact more shards and merge larger result sets. Design queries and shard keys together so that the application's most frequent queries are targeted. A sharded cluster where 90% of queries are scatter-gather provides no performance benefit over a replica set.

Sharding Commands

Essential sh.* and admin commands for cluster management

commands

Command	Purpose
`sh.status()`	Overview: shards, databases, collections, chunk distribution
`sh.enableSharding("db")`	Enable sharding on a database
`sh.shardCollection("db.col", key)`	Shard a collection with given shard key
`sh.addShard("rs/host:port")`	Add a new shard (replica set) to the cluster
`sh.removeShard("shardName")`	Begin draining a shard (migrates all chunks away)
`sh.splitAt("db.col", keyVal)`	Split chunk at specific shard key value
`sh.moveChunk("db.col", find, toShard)`	Manually move a chunk to a specific shard
`sh.stopBalancer()`	Disable automatic chunk balancing
`sh.startBalancer()`	Re-enable balancing
`sh.getBalancerState()`	Is balancer enabled?
`db.collection.getShardDistribution()`	Chunk and data distribution per shard

// Reshard a collection (MongoDB 5.0+): change the shard key online
db.adminCommand({
  reshardCollection: "mydb.orders",
  key: { customerId: 1 }    // new shard key
})
// Note: resharding is resource-intensive — plan during low traffic
// MongoDB copies data to new layout while accepting writes; commits when done