← back

Replica Sets
Replication

FILE  33_replication
TOPIC  Architecture · Elections · Write Concern · Read Preference · Oplog · Monitoring
LEVEL  Intermediate/Advanced
01
Replica Set Architecture
High availability and data redundancy through automatic replication
architecture

A replica set is a group of MongoDB nodes (typically 3 or 5) that maintain the same data set. One node is elected Primary and receives all writes; the remaining nodes are Secondaries that continuously replicate from the primary. If the primary fails, an automatic election promotes a secondary to primary within 10–12 seconds.

// Minimum recommended replica set: 3 nodes (1 primary + 2 secondaries)
// Replica set configuration (run on primary):
rs.initiate({
  _id: "myReplicaSet",
  members: [
    { _id: 0, host: "mongo1:27017", priority: 2 },    // preferred primary
    { _id: 1, host: "mongo2:27017", priority: 1 },
    { _id: 2, host: "mongo3:27017", priority: 1 }
  ]
})

// Check replica set status:
rs.status()          // full status including lag, oplog, last heartbeat
rs.conf()            // current configuration
rs.isMaster()        // deprecated; use db.hello() in 5.0+
db.hello()           // current primary, all members, connection info
NOTE
Always use an odd number of voting members (3, 5, 7) so elections can always achieve a majority. An even-numbered replica set can enter a state where no majority is reachable after a single failure, making election impossible.
02
Node Roles
Primary · Secondary · Arbiter · Hidden · Delayed
roles
RoleAccepts Writes?Holds Data?Can Vote?Use Case
PrimaryYes (only)YesYesSingle write entry point for all operations
SecondaryNo (reads with preference)YesYesFailover, read scaling, backup
ArbiterNoNoYesTiebreaker vote only — no data; cheap server
HiddenNoYesYes (optional)Dedicated reporting/analytics node — invisible to clients
DelayedNoYes (lagged)Yes (optional)Point-in-time recovery buffer (e.g., 1-hour delay)
// Add an arbiter (needs no data storage, just voting capacity)
rs.addArb("mongo-arbiter:27017")

// Configure a hidden node (for reporting queries)
cfg = rs.conf()
cfg.members[2].hidden   = true
cfg.members[2].priority = 0     // must be 0 for hidden nodes
rs.reconfig(cfg)

// Configure a delayed secondary (1-hour replication lag)
cfg = rs.conf()
cfg.members[3].secondaryDelaySecs = 3600
cfg.members[3].priority = 0
cfg.members[3].hidden   = true
rs.reconfig(cfg)
03
Elections
Automatic primary failover in 10–12 seconds
elections

When the primary becomes unreachable (crash, network partition, planned maintenance), the replica set holds an election to choose a new primary. Elections require a majority of voting members to agree.

// Election trigger conditions:
// 1. Primary stops sending heartbeats for electionTimeoutMillis (default 10s)
// 2. Primary steps down (rs.stepDown())
// 3. Network partition isolates primary from majority

// Election process:
// 1. Secondary detects missing heartbeats → starts election campaign
// 2. Candidate requests votes from other members
// 3. Members vote for candidate with most up-to-date oplog (highest opTime)
// 4. Candidate needs votes from majority (2 out of 3) to win
// 5. New primary begins accepting writes — total downtime: 10–12 seconds

// During election: writes fail, reads may fail (depends on readPreference)
// Applications should implement retry logic for primary-not-found errors

// Manually force a step-down (for planned maintenance):
rs.stepDown(60)     // step down, wait up to 60s for secondary to catch up

Priority and Vote Weight

// Priority: higher = more likely to become primary (0 = never eligible)
// Vote weight: 0 = non-voting member (can still replicate data)
cfg = rs.conf()
cfg.members[0].priority = 10   // strongly preferred primary
cfg.members[1].priority = 1
cfg.members[2].priority = 0    // never primary (e.g., analytics node)
cfg.members[2].votes    = 0    // non-voting (max 7 voting members in a set)
rs.reconfig(cfg)

CAP Theorem and Replica Sets

MongoDB replica sets are a CP system during partition: they prioritize Consistency over Availability. When a primary is isolated from the majority (network partition), it steps down and the cluster becomes temporarily unavailable for writes until a new primary is elected from the majority partition — preventing split-brain writes.

04
Write Concern
How many nodes must acknowledge a write for it to be considered durable
write concern
// w: number — specific count of nodes must acknowledge
db.orders.insertOne(doc, { writeConcern: { w: 1 } })        // primary only
db.orders.insertOne(doc, { writeConcern: { w: 2 } })        // primary + 1 secondary
db.orders.insertOne(doc, { writeConcern: { w: "majority" } }) // majority (safest)

// j: true — journal must be flushed to disk before acknowledgment
db.orders.insertOne(doc, { writeConcern: { w: "majority", j: true } })

// wtimeout: max milliseconds to wait for acknowledgment
db.orders.insertOne(doc, { writeConcern: { w: "majority", wtimeout: 5000 } })
// Throws WriteTimeoutError if not acknowledged within 5 seconds
// NOTE: wtimeout does NOT roll back the write — it just stops waiting
Write ConcernDurabilityLatencyWhen to Use
w:1Low — primary onlyFastestBulk imports, analytics inserts, non-critical
w:"majority"High — majority committedSlightly higher (replication ack)Financial data, user updates, anything important
w:"majority", j:trueHighest — disk-flushedHighestCritical financial transactions
w:0None — fire and forgetLowestMetrics, logging, acceptable data loss
05
Read Preference
Which replica set member serves read operations
read pref
ModeReads FromStale RiskUse Case
primary (default)Primary onlyNone — always freshAll reads requiring up-to-date data
primaryPreferredPrimary; fallback to secondaryYes (during failover)High availability reads; tolerate brief stale
secondarySecondary onlyYes — replication lagAnalytics, reporting, read scaling
secondaryPreferredSecondary; fallback to primaryYesRead-heavy apps; occasional fresh fallback
nearestLowest network latency memberYesGeographically distributed reads (lowest RTT)
// Set read preference on individual query
db.reports.find({ type: "monthly" }).readPref("secondary")

// Set read preference on connection string
// mongodb://mongo1,mongo2,mongo3/?replicaSet=myRS&readPreference=secondaryPreferred

// Tag sets: read from specific datacenter
db.getMongo().setReadPref("secondary", [{ datacenter: "eu-west-1" }])
WARN
Reading from secondaries introduces replication lag — your application may read data that hasn't reflected the most recent primary writes yet. Replication lag is typically milliseconds on a healthy cluster but can spike to seconds or minutes under heavy write load or network issues. Never route writes or consistency-sensitive reads to secondaries.
06
The Oplog
Operations log — the replication mechanism and its implications
oplog

The oplog (operations log) is a special capped collection in the local database on each replica set member. Every write operation on the primary is recorded as an idempotent entry in the oplog. Secondaries tail the primary's oplog and replay operations to stay in sync.

// View the oplog (on primary):
use local
db.oplog.rs.find().sort({ $natural: -1 }).limit(10)
// Each entry: { ts, op, ns, o (operation object), o2 (selector) }
// op values: "i" = insert, "u" = update, "d" = delete, "c" = command

// Check oplog window size and replication lag:
rs.printReplicationInfo()
// Output includes: configured oplog size, log length, optime range

rs.printSecondaryReplicationInfo()
// Output includes: replication lag per secondary

Oplog Size

The oplog is a capped collection — when full, oldest entries are overwritten. The oplog window is how far back in time a secondary can be and still sync. If a secondary falls behind further than the oplog window, it requires full re-sync.

// Default oplog size: ~5% of available disk space (min 1GB, max 50GB)
// Change oplog size (MongoDB 3.6+, can be done while running):
db.adminCommand({ replSetResizeOplog: 1, size: 16384 })  // 16GB in MB

// Rule of thumb: oplog window should cover at least your longest maintenance window
// If secondaries go offline for planned maintenance (2hr), oplog must last 2+ hours
07
Monitoring & Operations
Key commands for replica set health and maintenance
monitoring
// Replica set status overview
rs.status()
// Key fields per member:
//   stateStr: "PRIMARY" | "SECONDARY" | "ARBITER" | "RECOVERING" | "STARTUP2"
//   health: 1 (healthy) | 0 (down)
//   optimeDate: last operation time (gap vs primary = replication lag)
//   lastHeartbeatMessage: error detail if unhealthy

// Replication lag per secondary:
rs.printSecondaryReplicationInfo()
// Watch for: "lag is Xsec" — high lag means secondary is falling behind

// Check if current node is primary:
db.isMaster()        // { ismaster: true/false, ... }
db.hello()           // v5.0+ replacement

// Force a node to sync from a specific member:
db.adminCommand({ replSetSyncFrom: "mongo2:27017" })
CommandPurpose
rs.status()Health of all members, replication lag, state
rs.conf()Current replica set configuration
rs.printReplicationInfo()Oplog size, window, timestamps on primary
rs.printSecondaryReplicationInfo()Lag per secondary
rs.add("host:port")Add a new member to the set
rs.remove("host:port")Remove a member
rs.stepDown()Force primary to step down (triggers election)
rs.reconfig(cfg)Apply updated replica set configuration