33 · Replication

Replica Set Architecture

High availability and data redundancy through automatic replication

architecture

A replica set is a group of MongoDB nodes (typically 3 or 5) that maintain the same data set. One node is elected Primary and receives all writes; the remaining nodes are Secondaries that continuously replicate from the primary. If the primary fails, an automatic election promotes a secondary to primary within 10–12 seconds.

// Minimum recommended replica set: 3 nodes (1 primary + 2 secondaries)
// Replica set configuration (run on primary):
rs.initiate({
  _id: "myReplicaSet",
  members: [
    { _id: 0, host: "mongo1:27017", priority: 2 },    // preferred primary
    { _id: 1, host: "mongo2:27017", priority: 1 },
    { _id: 2, host: "mongo3:27017", priority: 1 }
  ]
})

// Check replica set status:
rs.status()          // full status including lag, oplog, last heartbeat
rs.conf()            // current configuration
rs.isMaster()        // deprecated; use db.hello() in 5.0+
db.hello()           // current primary, all members, connection info

NOTE

Always use an odd number of voting members (3, 5, 7) so elections can always achieve a majority. An even-numbered replica set can enter a state where no majority is reachable after a single failure, making election impossible.

Node Roles

Primary · Secondary · Arbiter · Hidden · Delayed

roles

Role	Accepts Writes?	Holds Data?	Can Vote?	Use Case
Primary	Yes (only)	Yes	Yes	Single write entry point for all operations
Secondary	No (reads with preference)	Yes	Yes	Failover, read scaling, backup
Arbiter	No	No	Yes	Tiebreaker vote only — no data; cheap server
Hidden	No	Yes	Yes (optional)	Dedicated reporting/analytics node — invisible to clients
Delayed	No	Yes (lagged)	Yes (optional)	Point-in-time recovery buffer (e.g., 1-hour delay)

// Add an arbiter (needs no data storage, just voting capacity)
rs.addArb("mongo-arbiter:27017")

// Configure a hidden node (for reporting queries)
cfg = rs.conf()
cfg.members[2].hidden   = true
cfg.members[2].priority = 0     // must be 0 for hidden nodes
rs.reconfig(cfg)

// Configure a delayed secondary (1-hour replication lag)
cfg = rs.conf()
cfg.members[3].secondaryDelaySecs = 3600
cfg.members[3].priority = 0
cfg.members[3].hidden   = true
rs.reconfig(cfg)

Elections

Automatic primary failover in 10–12 seconds

elections

When the primary becomes unreachable (crash, network partition, planned maintenance), the replica set holds an election to choose a new primary. Elections require a majority of voting members to agree.

// Election trigger conditions:
// 1. Primary stops sending heartbeats for electionTimeoutMillis (default 10s)
// 2. Primary steps down (rs.stepDown())
// 3. Network partition isolates primary from majority

// Election process:
// 1. Secondary detects missing heartbeats → starts election campaign
// 2. Candidate requests votes from other members
// 3. Members vote for candidate with most up-to-date oplog (highest opTime)
// 4. Candidate needs votes from majority (2 out of 3) to win
// 5. New primary begins accepting writes — total downtime: 10–12 seconds

// During election: writes fail, reads may fail (depends on readPreference)
// Applications should implement retry logic for primary-not-found errors

// Manually force a step-down (for planned maintenance):
rs.stepDown(60)     // step down, wait up to 60s for secondary to catch up

Priority and Vote Weight

// Priority: higher = more likely to become primary (0 = never eligible)
// Vote weight: 0 = non-voting member (can still replicate data)
cfg = rs.conf()
cfg.members[0].priority = 10   // strongly preferred primary
cfg.members[1].priority = 1
cfg.members[2].priority = 0    // never primary (e.g., analytics node)
cfg.members[2].votes    = 0    // non-voting (max 7 voting members in a set)
rs.reconfig(cfg)

CAP Theorem and Replica Sets

MongoDB replica sets are a CP system during partition: they prioritize Consistency over Availability. When a primary is isolated from the majority (network partition), it steps down and the cluster becomes temporarily unavailable for writes until a new primary is elected from the majority partition — preventing split-brain writes.

Write Concern

How many nodes must acknowledge a write for it to be considered durable

write concern

// w: number — specific count of nodes must acknowledge
db.orders.insertOne(doc, { writeConcern: { w: 1 } })        // primary only
db.orders.insertOne(doc, { writeConcern: { w: 2 } })        // primary + 1 secondary
db.orders.insertOne(doc, { writeConcern: { w: "majority" } }) // majority (safest)

// j: true — journal must be flushed to disk before acknowledgment
db.orders.insertOne(doc, { writeConcern: { w: "majority", j: true } })

// wtimeout: max milliseconds to wait for acknowledgment
db.orders.insertOne(doc, { writeConcern: { w: "majority", wtimeout: 5000 } })
// Throws WriteTimeoutError if not acknowledged within 5 seconds
// NOTE: wtimeout does NOT roll back the write — it just stops waiting

Write Concern	Durability	Latency	When to Use
`w:1`	Low — primary only	Fastest	Bulk imports, analytics inserts, non-critical
`w:"majority"`	High — majority committed	Slightly higher (replication ack)	Financial data, user updates, anything important
`w:"majority", j:true`	Highest — disk-flushed	Highest	Critical financial transactions
`w:0`	None — fire and forget	Lowest	Metrics, logging, acceptable data loss

Read Preference

Which replica set member serves read operations

read pref

Mode	Reads From	Stale Risk	Use Case
`primary` (default)	Primary only	None — always fresh	All reads requiring up-to-date data
`primaryPreferred`	Primary; fallback to secondary	Yes (during failover)	High availability reads; tolerate brief stale
`secondary`	Secondary only	Yes — replication lag	Analytics, reporting, read scaling
`secondaryPreferred`	Secondary; fallback to primary	Yes	Read-heavy apps; occasional fresh fallback
`nearest`	Lowest network latency member	Yes	Geographically distributed reads (lowest RTT)

// Set read preference on individual query
db.reports.find({ type: "monthly" }).readPref("secondary")

// Set read preference on connection string
// mongodb://mongo1,mongo2,mongo3/?replicaSet=myRS&readPreference=secondaryPreferred

// Tag sets: read from specific datacenter
db.getMongo().setReadPref("secondary", [{ datacenter: "eu-west-1" }])

WARN

Reading from secondaries introduces replication lag — your application may read data that hasn't reflected the most recent primary writes yet. Replication lag is typically milliseconds on a healthy cluster but can spike to seconds or minutes under heavy write load or network issues. Never route writes or consistency-sensitive reads to secondaries.

The Oplog

Operations log — the replication mechanism and its implications

oplog

The oplog (operations log) is a special capped collection in the local database on each replica set member. Every write operation on the primary is recorded as an idempotent entry in the oplog. Secondaries tail the primary's oplog and replay operations to stay in sync.

// View the oplog (on primary):
use local
db.oplog.rs.find().sort({ $natural: -1 }).limit(10)
// Each entry: { ts, op, ns, o (operation object), o2 (selector) }
// op values: "i" = insert, "u" = update, "d" = delete, "c" = command

// Check oplog window size and replication lag:
rs.printReplicationInfo()
// Output includes: configured oplog size, log length, optime range

rs.printSecondaryReplicationInfo()
// Output includes: replication lag per secondary

Oplog Size

The oplog is a capped collection — when full, oldest entries are overwritten. The oplog window is how far back in time a secondary can be and still sync. If a secondary falls behind further than the oplog window, it requires full re-sync.

// Default oplog size: ~5% of available disk space (min 1GB, max 50GB)
// Change oplog size (MongoDB 3.6+, can be done while running):
db.adminCommand({ replSetResizeOplog: 1, size: 16384 })  // 16GB in MB

// Rule of thumb: oplog window should cover at least your longest maintenance window
// If secondaries go offline for planned maintenance (2hr), oplog must last 2+ hours

Monitoring & Operations

Key commands for replica set health and maintenance

monitoring

// Replica set status overview
rs.status()
// Key fields per member:
//   stateStr: "PRIMARY" | "SECONDARY" | "ARBITER" | "RECOVERING" | "STARTUP2"
//   health: 1 (healthy) | 0 (down)
//   optimeDate: last operation time (gap vs primary = replication lag)
//   lastHeartbeatMessage: error detail if unhealthy

// Replication lag per secondary:
rs.printSecondaryReplicationInfo()
// Watch for: "lag is Xsec" — high lag means secondary is falling behind

// Check if current node is primary:
db.isMaster()        // { ismaster: true/false, ... }
db.hello()           // v5.0+ replacement

// Force a node to sync from a specific member:
db.adminCommand({ replSetSyncFrom: "mongo2:27017" })

Command	Purpose
`rs.status()`	Health of all members, replication lag, state
`rs.conf()`	Current replica set configuration
`rs.printReplicationInfo()`	Oplog size, window, timestamps on primary
`rs.printSecondaryReplicationInfo()`	Lag per secondary
`rs.add("host:port")`	Add a new member to the set
`rs.remove("host:port")`	Remove a member
`rs.stepDown()`	Force primary to step down (triggers election)
`rs.reconfig(cfg)`	Apply updated replica set configuration