30 · Data Modeling Patterns

Pattern Overview

Named solutions to recurring MongoDB schema problems

overview

MongoDB data modeling patterns are named, repeatable solutions to common schema design problems. Unlike relational normalization rules, these patterns are access-pattern-driven — each solves a specific combination of read/write performance challenges.

Pattern	Problem Solved	Core Mechanism
Extended Reference	Frequent lookup of foreign document fields	Embed hot fields alongside the reference ID
Computed	Expensive aggregations on every read	Pre-compute and store aggregated values
Subset	Unbounded array in hot document	Embed last-N items; full list in child collection
Bucket	High-frequency time-series writes	Group N time-adjacent records per document
Outlier	Few documents vastly larger than most	Overflow flag + separate overflow collection
Polymorphic	Documents with varying structures in one collection	Discriminator field + shared index
Schema Versioning	Schema evolution without migration	Store schema version number in each document

NOTE

Patterns are not mutually exclusive. A single schema can simultaneously use Computed (for like counts), Subset (for recent comments), and Extended Reference (for author info). Combine as needed per access pattern.

Extended Reference Pattern

Embed frequently-read foreign fields alongside the reference ID

extended ref

Problem: Documents reference another collection by ID, but every read also needs 2–3 fields from the referenced document (e.g., customer name when displaying an order). This forces a second query or $lookup on every read.

Solution: Keep the reference ID for full-document access, but embed the most-accessed fields as a denormalized snapshot directly in the referencing document.

// Without Extended Reference: 2 queries to render order + customer name
const order = db.orders.findOne({ _id: orderId })
const cust  = db.customers.findOne({ _id: order.customerId })  // extra round-trip

// With Extended Reference: 1 query — customer snapshot embedded
{
  _id:        ObjectId("order1"),
  customerId: ObjectId("cust1"),    // ← full reference kept
  customer: {                       // ← hot fields snapshot
    name:   "Alice Johnson",
    email:  "alice@example.com",
    tier:   "gold"
  },
  lineItems: [ ... ],
  total: 249.99,
  createdAt: ISODate("2024-03-01")
}
// Render order: 1 document read → all needed data present

What to Embed vs What to Reference

Embed (Hot Fields)	Keep as Reference Only
Fields needed on every display of the parent	Full document with all fields
Fields that rarely change (name, tier)	Frequently-changing fields (balance, address)
Fields needed without a lookup	Fields only needed in detail views

Handling Stale Embedded Data

// When a customer changes their name → update all affected embedded snapshots
db.orders.updateMany(
  { "customer.email": "old@example.com" },
  { $set: { "customer.name": "Alice Smith", "customer.email": "new@example.com" } }
)
// Tradeoff: write fan-out on customer update
// Acceptable when: customer changes are rare; order display is very frequent
// Not acceptable when: embedded field changes constantly (e.g., live stock price)

Computed Pattern

Pre-calculate aggregated values on write; read in O(1)

computed

Problem: A value that every read needs (e.g., total order count, like count, average rating) requires an expensive aggregation that scans thousands of documents.

Solution: Calculate the value once on each write event and store it in the document. Reads become a simple field access.

// Problem: every product page view runs this expensive count
const reviewCount = db.reviews.countDocuments({ productId: productId })  // O(N)
const avgRating   = db.reviews.aggregate([
  { $match: { productId: productId } },
  { $group: { _id: null, avg: { $avg: "$rating" } } }
])  // scans all reviews every page view

// Solution: pre-computed fields maintained on write
{
  _id:         ObjectId("prod1"),
  name:        "Wireless Headphones",
  reviewCount: 2847,      // ← maintained via $inc
  avgRating:   4.3,       // ← recalculated on new review
  totalRevenue: 451456.53 // ← maintained via $inc
}

// On new review submission: atomic update
db.products.updateOne(
  { _id: productId },
  [
    {
      $set: {
        reviewCount: { $add: ["$reviewCount", 1] },
        // New avg = (oldAvg * oldCount + newRating) / (oldCount + 1)
        avgRating: {
          $divide: [
            { $add: [{ $multiply: ["$avgRating", "$reviewCount"] }, newRating] },
            { $add: ["$reviewCount", 1] }
          ]
        }
      }
    }
  ]
)

Write Frequency vs Read Frequency Trade-off

Scenario	Suitable for Computed?	Reason
Like count: 1000 reads/sec, 5 writes/sec	Yes	Read-heavy; write overhead is tiny
Stock price: updates every second on many docs	No	High write rate; pre-computation adds more writes
Monthly revenue total: updated on each sale	Yes	Sales are infrequent; reporting is frequent
Real-time follower count for analytics dashboard	Maybe	Acceptable if slightly stale values are OK

Subset Pattern

Embed the N most relevant items; store the rest in a child collection

subset

Problem: An array of related items is needed for display, but the full list is too large to embed (hits 16MB or bloats the working set) yet too important to always require a second query.

Solution: Embed a bounded subset (e.g., the 5 most recent) directly in the parent, and store the full list in a separate collection queried on demand.

// Product document with subset of reviews
{
  _id:          ObjectId("prod1"),
  name:         "Wireless Headphones",
  avgRating:    4.3,
  reviewCount:  2847,
  recentReviews: [           // ← bounded subset — exactly 5
    { userId: ObjectId("u1"), rating: 5, text: "Amazing sound!", date: ISODate("2024-03-10") },
    { userId: ObjectId("u2"), rating: 4, text: "Great value",    date: ISODate("2024-03-09") },
    { userId: ObjectId("u3"), rating: 5, text: "Love it",        date: ISODate("2024-03-08") }
  ]
}

// reviews collection (full list — indexed on productId)
{ _id: ObjectId("r1"), productId: ObjectId("prod1"), userId: ObjectId("u1"),
  rating: 5, text: "Amazing sound!", date: ISODate("2024-03-10") }

// On new review: maintain subset with $push + $slice
db.products.updateOne(
  { _id: productId },
  {
    $push: {
      recentReviews: {
        $each:  [newReview],
        $sort:  { date: -1 },   // most recent first
        $slice: 5               // capped at exactly 5
      }
    },
    $inc: { reviewCount: 1 }
  }
)

// Load all reviews (user clicked "show all"):
db.reviews.find({ productId: productId }).sort({ date: -1 }).limit(20)
db.reviews.createIndex({ productId: 1, date: -1 })   // for fast pagination

TIP

The subset size (5 in this example) should match the UX: if the product page shows "5 most helpful reviews", embed exactly 5. If it shows "3 top reviews", embed 3. The subset size should be fixed and predetermined — don't change it dynamically, or the document size becomes unpredictable.

Bucket Pattern

Group time-series records into per-period documents

bucket

Problem: Time-series data (IoT sensors, metrics, financial ticks) generates one document per reading — millions of tiny documents that bloat the index, exhaust RAM, and slow range queries.

Solution: Group readings from the same time period (hour, day) into a single "bucket" document. Dramatically reduces document count and index size.

// ❌ Naive: 1 doc per sensor reading → 525,600 docs/sensor/year
{ sensorId: "S1", ts: ISODate("2024-03-01T10:00:00Z"), temp: 22.4, humid: 60 }
{ sensorId: "S1", ts: ISODate("2024-03-01T10:01:00Z"), temp: 22.5, humid: 60 }

// ✅ Bucket: 1 doc per hour per sensor → 8,760 docs/sensor/year
{
  sensorId:  "S1",
  hour:      ISODate("2024-03-01T10:00:00Z"),   // bucket key
  count:     60,
  readings:  [
    { m: 0, temp: 22.4, humid: 60 },
    { m: 1, temp: 22.5, humid: 60 },
    // ... 58 more
  ],
  stats: { minTemp: 22.1, maxTemp: 23.0, avgTemp: 22.6 }  // pre-computed
}

// Insert reading into current bucket (or create if not exists)
db.sensor_data.updateOne(
  { sensorId: "S1", hour: currentHour, count: { $lt: 60 } },
  {
    $push: { readings: { m: currentMinute, temp: 22.6, humid: 59 } },
    $inc:  { count: 1 },
    $min:  { "stats.minTemp": 22.6 },
    $max:  { "stats.maxTemp": 22.6 }
  },
  { upsert: true }
)

// Indexes for time-range queries
db.sensor_data.createIndex({ sensorId: 1, hour: -1 })

// Range query: last 24 hours of readings for sensor S1
db.sensor_data.find({
  sensorId: "S1",
  hour: { $gte: ISODate("2024-03-01T00:00:00Z") }
}).sort({ hour: -1 })

NOTE

MongoDB 5.0+ has native Time Series Collections which implement the bucket pattern automatically with additional optimizations (automatic bucketing, compression, dedicated time-window queries). For new time-series workloads, prefer native time series over manual bucket implementation: db.createCollection("sensor_data", { timeseries: { timeField: "ts", metaField: "sensorId", granularity: "minutes" } })

Outlier & Polymorphic Patterns

Handling exceptional documents and mixed-structure collections

outlier

Outlier Pattern

Problem: 99% of documents fit the normal schema (e.g., a user with 500 followers), but 1% are vastly larger (a celebrity with 50M followers). Designing the schema for the outlier wastes space for everyone else. Designing for the normal case breaks for outliers.

Solution: Design for the common case. Add an hasOverflow flag to outlier documents and store the overflow in a separate collection.

// Normal user: followers array embedded (bounded ≤ 500)
{
  _id:      ObjectId("u1"),
  username: "regular_user",
  followerCount: 487,
  followers: [ ObjectId("f1"), ObjectId("f2"), ... ]   // embedded, fine
}

// Celebrity: overflow flag — followers in separate collection
{
  _id:         ObjectId("u99"),
  username:    "celebrity",
  followerCount: 50000000,
  hasOverflow: true,           // ← outlier flag
  followers:   []              // or first 500 for quick preview
}

// Application code checks flag:
if (user.hasOverflow) {
  followers = db.follower_overflow.find({ userId: user._id }).limit(pageSize)
} else {
  followers = user.followers
}

Polymorphic Pattern

Problem: Multiple entity types with different structures belong logically in the same collection (e.g., different product types: books, electronics, clothing — each with different fields).

Solution: Store all types in one collection. Add a discriminator field (e.g., type or productType) that tells the application which fields to expect. Create partial indexes per type.

// All products in one collection — different fields per type
{ _id: ObjectId("b1"), type: "book",        title: "MongoDB Basics", author: "J. Doe", isbn: "978-..." }
{ _id: ObjectId("e1"), type: "electronics", title: "Headphones",     brand: "Sony",    wattage: 15 }
{ _id: ObjectId("c1"), type: "clothing",    title: "T-Shirt",        size: "M",        material: "cotton" }

// Partial index on type-specific field
db.products.createIndex(
  { isbn: 1 },
  { partialFilterExpression: { type: "book" }, unique: true }
)

// Benefits: single collection query, single index for cross-type search
db.products.find({ $text: { $search: "wireless" } })  // searches all types
db.products.find({ type: "electronics", brand: "Sony" })

Schema Versioning

Evolving schemas without full migration downtime

versioning

Problem: The application evolves and the schema must change. Running a migration script to update millions of documents is risky, slow, and requires downtime. Old documents and new documents coexist during migration.

Solution: Store a schemaVersion field in every document. Application code handles multiple versions. Documents are migrated lazily (on read) or via a background job.

// Version 1 documents (old schema)
{ _id: ObjectId("u1"), schemaVersion: 1, name: "Alice Johnson", phone: "555-1234" }

// Version 2 documents (new schema — phone split into object)
{
  _id: ObjectId("u2"), schemaVersion: 2,
  name: "Bob Smith",
  phone: { countryCode: "+1", number: "555-5678" }
}

// Application handles both versions:
function getPhone(user) {
  if (user.schemaVersion === 1) {
    return user.phone                          // string
  } else {
    return `${user.phone.countryCode} ${user.phone.number}`
  }
}

// Optional: lazy migration on read (upgrade on access)
function getUser(userId) {
  const user = db.users.findOne({ _id: userId })
  if (user.schemaVersion < 2) {
    const [cc, num] = user.phone.split("-")
    db.users.updateOne({ _id: userId }, {
      $set: { phone: { countryCode: "+1", number: user.phone }, schemaVersion: 2 }
    })
  }
  return user
}

// Background job: migrate remaining v1 docs during off-peak hours
db.users.find({ schemaVersion: { $lt: 2 } }).forEach(user => {
  db.users.updateOne({ _id: user._id }, {
    $set: { phone: { countryCode: "+1", number: user.phone }, schemaVersion: 2 }
  })
})

Pattern Selection Quick Reference

Symptom	Apply Pattern
Every read does a second $lookup for 2–3 fields	Extended Reference
$count/$sum runs on every page load	Computed
Array grows without bound in hot document	Subset (bounded embed) + child collection
Millions of tiny time-series documents	Bucket (or native Time Series)
A few documents are 100x larger than average	Outlier
Multiple entity types in one logical group	Polymorphic
Schema changed; old documents still in production	Schema Versioning