← back

Data Modeling
Patterns

FILE  30_data_modeling_patterns
TOPIC  Extended Reference · Computed · Subset · Bucket · Outlier · Polymorphic · Schema Versioning
LEVEL  Intermediate/Advanced
01
Pattern Overview
Named solutions to recurring MongoDB schema problems
overview

MongoDB data modeling patterns are named, repeatable solutions to common schema design problems. Unlike relational normalization rules, these patterns are access-pattern-driven — each solves a specific combination of read/write performance challenges.

PatternProblem SolvedCore Mechanism
Extended ReferenceFrequent lookup of foreign document fieldsEmbed hot fields alongside the reference ID
ComputedExpensive aggregations on every readPre-compute and store aggregated values
SubsetUnbounded array in hot documentEmbed last-N items; full list in child collection
BucketHigh-frequency time-series writesGroup N time-adjacent records per document
OutlierFew documents vastly larger than mostOverflow flag + separate overflow collection
PolymorphicDocuments with varying structures in one collectionDiscriminator field + shared index
Schema VersioningSchema evolution without migrationStore schema version number in each document
NOTE
Patterns are not mutually exclusive. A single schema can simultaneously use Computed (for like counts), Subset (for recent comments), and Extended Reference (for author info). Combine as needed per access pattern.
02
Extended Reference Pattern
Embed frequently-read foreign fields alongside the reference ID
extended ref

Problem: Documents reference another collection by ID, but every read also needs 2–3 fields from the referenced document (e.g., customer name when displaying an order). This forces a second query or $lookup on every read.

Solution: Keep the reference ID for full-document access, but embed the most-accessed fields as a denormalized snapshot directly in the referencing document.

// Without Extended Reference: 2 queries to render order + customer name
const order = db.orders.findOne({ _id: orderId })
const cust  = db.customers.findOne({ _id: order.customerId })  // extra round-trip

// With Extended Reference: 1 query — customer snapshot embedded
{
  _id:        ObjectId("order1"),
  customerId: ObjectId("cust1"),    // ← full reference kept
  customer: {                       // ← hot fields snapshot
    name:   "Alice Johnson",
    email:  "alice@example.com",
    tier:   "gold"
  },
  lineItems: [ ... ],
  total: 249.99,
  createdAt: ISODate("2024-03-01")
}
// Render order: 1 document read → all needed data present

What to Embed vs What to Reference

Embed (Hot Fields)Keep as Reference Only
Fields needed on every display of the parentFull document with all fields
Fields that rarely change (name, tier)Frequently-changing fields (balance, address)
Fields needed without a lookupFields only needed in detail views

Handling Stale Embedded Data

// When a customer changes their name → update all affected embedded snapshots
db.orders.updateMany(
  { "customer.email": "old@example.com" },
  { $set: { "customer.name": "Alice Smith", "customer.email": "new@example.com" } }
)
// Tradeoff: write fan-out on customer update
// Acceptable when: customer changes are rare; order display is very frequent
// Not acceptable when: embedded field changes constantly (e.g., live stock price)
03
Computed Pattern
Pre-calculate aggregated values on write; read in O(1)
computed

Problem: A value that every read needs (e.g., total order count, like count, average rating) requires an expensive aggregation that scans thousands of documents.

Solution: Calculate the value once on each write event and store it in the document. Reads become a simple field access.

// Problem: every product page view runs this expensive count
const reviewCount = db.reviews.countDocuments({ productId: productId })  // O(N)
const avgRating   = db.reviews.aggregate([
  { $match: { productId: productId } },
  { $group: { _id: null, avg: { $avg: "$rating" } } }
])  // scans all reviews every page view

// Solution: pre-computed fields maintained on write
{
  _id:         ObjectId("prod1"),
  name:        "Wireless Headphones",
  reviewCount: 2847,      // ← maintained via $inc
  avgRating:   4.3,       // ← recalculated on new review
  totalRevenue: 451456.53 // ← maintained via $inc
}

// On new review submission: atomic update
db.products.updateOne(
  { _id: productId },
  [
    {
      $set: {
        reviewCount: { $add: ["$reviewCount", 1] },
        // New avg = (oldAvg * oldCount + newRating) / (oldCount + 1)
        avgRating: {
          $divide: [
            { $add: [{ $multiply: ["$avgRating", "$reviewCount"] }, newRating] },
            { $add: ["$reviewCount", 1] }
          ]
        }
      }
    }
  ]
)

Write Frequency vs Read Frequency Trade-off

ScenarioSuitable for Computed?Reason
Like count: 1000 reads/sec, 5 writes/secYesRead-heavy; write overhead is tiny
Stock price: updates every second on many docsNoHigh write rate; pre-computation adds more writes
Monthly revenue total: updated on each saleYesSales are infrequent; reporting is frequent
Real-time follower count for analytics dashboardMaybeAcceptable if slightly stale values are OK
04
Subset Pattern
Embed the N most relevant items; store the rest in a child collection
subset

Problem: An array of related items is needed for display, but the full list is too large to embed (hits 16MB or bloats the working set) yet too important to always require a second query.

Solution: Embed a bounded subset (e.g., the 5 most recent) directly in the parent, and store the full list in a separate collection queried on demand.

// Product document with subset of reviews
{
  _id:          ObjectId("prod1"),
  name:         "Wireless Headphones",
  avgRating:    4.3,
  reviewCount:  2847,
  recentReviews: [           // ← bounded subset — exactly 5
    { userId: ObjectId("u1"), rating: 5, text: "Amazing sound!", date: ISODate("2024-03-10") },
    { userId: ObjectId("u2"), rating: 4, text: "Great value",    date: ISODate("2024-03-09") },
    { userId: ObjectId("u3"), rating: 5, text: "Love it",        date: ISODate("2024-03-08") }
  ]
}

// reviews collection (full list — indexed on productId)
{ _id: ObjectId("r1"), productId: ObjectId("prod1"), userId: ObjectId("u1"),
  rating: 5, text: "Amazing sound!", date: ISODate("2024-03-10") }

// On new review: maintain subset with $push + $slice
db.products.updateOne(
  { _id: productId },
  {
    $push: {
      recentReviews: {
        $each:  [newReview],
        $sort:  { date: -1 },   // most recent first
        $slice: 5               // capped at exactly 5
      }
    },
    $inc: { reviewCount: 1 }
  }
)

// Load all reviews (user clicked "show all"):
db.reviews.find({ productId: productId }).sort({ date: -1 }).limit(20)
db.reviews.createIndex({ productId: 1, date: -1 })   // for fast pagination
TIP
The subset size (5 in this example) should match the UX: if the product page shows "5 most helpful reviews", embed exactly 5. If it shows "3 top reviews", embed 3. The subset size should be fixed and predetermined — don't change it dynamically, or the document size becomes unpredictable.
05
Bucket Pattern
Group time-series records into per-period documents
bucket

Problem: Time-series data (IoT sensors, metrics, financial ticks) generates one document per reading — millions of tiny documents that bloat the index, exhaust RAM, and slow range queries.

Solution: Group readings from the same time period (hour, day) into a single "bucket" document. Dramatically reduces document count and index size.

// ❌ Naive: 1 doc per sensor reading → 525,600 docs/sensor/year
{ sensorId: "S1", ts: ISODate("2024-03-01T10:00:00Z"), temp: 22.4, humid: 60 }
{ sensorId: "S1", ts: ISODate("2024-03-01T10:01:00Z"), temp: 22.5, humid: 60 }

// ✅ Bucket: 1 doc per hour per sensor → 8,760 docs/sensor/year
{
  sensorId:  "S1",
  hour:      ISODate("2024-03-01T10:00:00Z"),   // bucket key
  count:     60,
  readings:  [
    { m: 0, temp: 22.4, humid: 60 },
    { m: 1, temp: 22.5, humid: 60 },
    // ... 58 more
  ],
  stats: { minTemp: 22.1, maxTemp: 23.0, avgTemp: 22.6 }  // pre-computed
}

// Insert reading into current bucket (or create if not exists)
db.sensor_data.updateOne(
  { sensorId: "S1", hour: currentHour, count: { $lt: 60 } },
  {
    $push: { readings: { m: currentMinute, temp: 22.6, humid: 59 } },
    $inc:  { count: 1 },
    $min:  { "stats.minTemp": 22.6 },
    $max:  { "stats.maxTemp": 22.6 }
  },
  { upsert: true }
)

// Indexes for time-range queries
db.sensor_data.createIndex({ sensorId: 1, hour: -1 })

// Range query: last 24 hours of readings for sensor S1
db.sensor_data.find({
  sensorId: "S1",
  hour: { $gte: ISODate("2024-03-01T00:00:00Z") }
}).sort({ hour: -1 })
NOTE
MongoDB 5.0+ has native Time Series Collections which implement the bucket pattern automatically with additional optimizations (automatic bucketing, compression, dedicated time-window queries). For new time-series workloads, prefer native time series over manual bucket implementation: db.createCollection("sensor_data", { timeseries: { timeField: "ts", metaField: "sensorId", granularity: "minutes" } })
06
Outlier & Polymorphic Patterns
Handling exceptional documents and mixed-structure collections
outlier

Outlier Pattern

Problem: 99% of documents fit the normal schema (e.g., a user with 500 followers), but 1% are vastly larger (a celebrity with 50M followers). Designing the schema for the outlier wastes space for everyone else. Designing for the normal case breaks for outliers.

Solution: Design for the common case. Add an hasOverflow flag to outlier documents and store the overflow in a separate collection.

// Normal user: followers array embedded (bounded ≤ 500)
{
  _id:      ObjectId("u1"),
  username: "regular_user",
  followerCount: 487,
  followers: [ ObjectId("f1"), ObjectId("f2"), ... ]   // embedded, fine
}

// Celebrity: overflow flag — followers in separate collection
{
  _id:         ObjectId("u99"),
  username:    "celebrity",
  followerCount: 50000000,
  hasOverflow: true,           // ← outlier flag
  followers:   []              // or first 500 for quick preview
}

// Application code checks flag:
if (user.hasOverflow) {
  followers = db.follower_overflow.find({ userId: user._id }).limit(pageSize)
} else {
  followers = user.followers
}

Polymorphic Pattern

Problem: Multiple entity types with different structures belong logically in the same collection (e.g., different product types: books, electronics, clothing — each with different fields).

Solution: Store all types in one collection. Add a discriminator field (e.g., type or productType) that tells the application which fields to expect. Create partial indexes per type.

// All products in one collection — different fields per type
{ _id: ObjectId("b1"), type: "book",        title: "MongoDB Basics", author: "J. Doe", isbn: "978-..." }
{ _id: ObjectId("e1"), type: "electronics", title: "Headphones",     brand: "Sony",    wattage: 15 }
{ _id: ObjectId("c1"), type: "clothing",    title: "T-Shirt",        size: "M",        material: "cotton" }

// Partial index on type-specific field
db.products.createIndex(
  { isbn: 1 },
  { partialFilterExpression: { type: "book" }, unique: true }
)

// Benefits: single collection query, single index for cross-type search
db.products.find({ $text: { $search: "wireless" } })  // searches all types
db.products.find({ type: "electronics", brand: "Sony" })
07
Schema Versioning
Evolving schemas without full migration downtime
versioning

Problem: The application evolves and the schema must change. Running a migration script to update millions of documents is risky, slow, and requires downtime. Old documents and new documents coexist during migration.

Solution: Store a schemaVersion field in every document. Application code handles multiple versions. Documents are migrated lazily (on read) or via a background job.

// Version 1 documents (old schema)
{ _id: ObjectId("u1"), schemaVersion: 1, name: "Alice Johnson", phone: "555-1234" }

// Version 2 documents (new schema — phone split into object)
{
  _id: ObjectId("u2"), schemaVersion: 2,
  name: "Bob Smith",
  phone: { countryCode: "+1", number: "555-5678" }
}

// Application handles both versions:
function getPhone(user) {
  if (user.schemaVersion === 1) {
    return user.phone                          // string
  } else {
    return `${user.phone.countryCode} ${user.phone.number}`
  }
}

// Optional: lazy migration on read (upgrade on access)
function getUser(userId) {
  const user = db.users.findOne({ _id: userId })
  if (user.schemaVersion < 2) {
    const [cc, num] = user.phone.split("-")
    db.users.updateOne({ _id: userId }, {
      $set: { phone: { countryCode: "+1", number: user.phone }, schemaVersion: 2 }
    })
  }
  return user
}

// Background job: migrate remaining v1 docs during off-peak hours
db.users.find({ schemaVersion: { $lt: 2 } }).forEach(user => {
  db.users.updateOne({ _id: user._id }, {
    $set: { phone: { countryCode: "+1", number: user.phone }, schemaVersion: 2 }
  })
})

Pattern Selection Quick Reference

SymptomApply Pattern
Every read does a second $lookup for 2–3 fieldsExtended Reference
$count/$sum runs on every page loadComputed
Array grows without bound in hot documentSubset (bounded embed) + child collection
Millions of tiny time-series documentsBucket (or native Time Series)
A few documents are 100x larger than averageOutlier
Multiple entity types in one logical groupPolymorphic
Schema changed; old documents still in productionSchema Versioning