← back

$group
Aggregation & Accumulators

FILE  22_aggregation_group
TOPIC  _id Key · Accumulators · Null Group · Memory Limits · Compound Keys · Patterns
LEVEL  Foundation
01
What is $group
Collapse many documents into one per group key
group

$group is the aggregation pipeline's equivalent of SQL GROUP BY. It consumes all input documents and collapses them into one output document per unique _id value. Accumulators compute values across all documents in each group.

Basic Syntax

{
  $group: {
    _id:          <expression>,             // REQUIRED — the grouping key
    <fieldName>:  { <accumulator>: <expr> } // one or more accumulated fields
  }
}

// Group orders by status, count and sum
db.orders.aggregate([
  {
    $group: {
      _id:         "$status",
      orderCount:  { $sum: 1 },
      totalAmount: { $sum: "$amount" },
      avgAmount:   { $avg: "$amount" }
    }
  }
])
// Output — one document per distinct status value:
// { _id: "completed", orderCount: 142, totalAmount: 58240, avgAmount: 410.1 }
// { _id: "pending",   orderCount:  31, totalAmount:  9800, avgAmount: 316.1 }

What Survives After $group

$group is destructive — only the _id field and explicitly accumulated fields survive in the output. All other fields from the input documents are gone.

// Input documents have: _id, name, status, amount, region, createdAt
// After $group on status — only _id and declared accumulators remain
{
  $group: {
    _id:   "$status",
    count: { $sum: 1 }
  }
}
// Output: { _id: "completed", count: 142 }
// Gone: name, amount, region, createdAt — no reference to originals
NOTE
After $group, the only way to reference original document fields is through accumulator outputs. If you need a scalar field from the source (like the document's name), use $first or $last to capture it during grouping. You cannot reach back to original fields after the stage is past.
02
The _id Key
Grouping granularity — null, scalar, expression, compound
key

The _id field in $group defines the grouping key. Every document with the same _id value produces a single output document. The value can be any expression.

_id: null — Whole Collection Aggregation

// _id: null groups ALL documents into a single output document
db.orders.aggregate([
  {
    $group: {
      _id:         null,
      totalRevenue: { $sum: "$amount" },
      orderCount:  { $sum: 1 },
      avgOrder:    { $avg: "$amount" }
    }
  }
])
// Output: one document summarising the entire collection
// { _id: null, totalRevenue: 108420, orderCount: 342, avgOrder: 316.7 }

Scalar Field Reference

// Group by a single field — most common pattern
{ $group: { _id: "$category", count: { $sum: 1 } } }
{ $group: { _id: "$region",   total: { $sum: "$sales" } } }
{ $group: { _id: "$userId",   orderCount: { $sum: 1 } } }

Expression as _id

// Group by computed value — e.g., first letter of name
{
  $group: {
    _id:   { $substrCP: ["$name", 0, 1] },  // first character
    count: { $sum: 1 }
  }
}

// Group by price tier (bucketing)
{
  $group: {
    _id: {
      $switch: {
        branches: [
          { case: { $lt: ["$price", 10]  }, then: "budget" },
          { case: { $lt: ["$price", 50]  }, then: "mid" },
          { case: { $lt: ["$price", 200] }, then: "premium" }
        ],
        default: "luxury"
      }
    },
    count: { $sum: 1 },
    avgPrice: { $avg: "$price" }
  }
}
TIP
Any valid expression can be the _id. Anything that evaluates to the same value groups together. This means you can group by computed fields, string transformations, date extractions, or any expression — not just existing field values.
03
Accumulators
$sum · $avg · $min · $max · $first · $last · $push · $addToSet · $count
accumulators

Counting and Summing

// $sum — add values; $sum: 1 counts documents
{ $sum: "$amount" }   // sum of amount field across group
{ $sum: 1 }           // count all documents in group
{ $sum: { $cond: [{ $eq: ["$status", "completed"] }, 1, 0] } }
// conditional count — only completed orders

// $count — shorthand for { $sum: 1 } in $group context (MongoDB 5.0+)
{ $count: { } }       // count documents in group

Statistical Accumulators

// $avg — arithmetic mean; ignores null/missing values
{ $avg: "$score" }

// $min and $max — extreme values; BSON type order applies
{ $min: "$price" }
{ $max: "$price" }
{ $min: "$createdAt" }   // earliest date
{ $max: "$createdAt" }   // latest date

// $stdDevPop — population standard deviation (all docs are the population)
{ $stdDevPop: "$responseTime" }

// $stdDevSamp — sample standard deviation (group is a sample of larger set)
{ $stdDevSamp: "$responseTime" }

$first and $last

// $first — value from the FIRST document in each group
// $last  — value from the LAST document in each group
// Both depend on document order entering the stage — $sort BEFORE $group

// Get the largest order per customer (sort desc first, then $first picks it)
db.orders.aggregate([
  { $sort:  { amount: -1 } },       // highest amount first
  {
    $group: {
      _id:       "$customerId",
      topOrder:  { $first: "$amount" },    // first = highest after sort
      orderDate: { $first: "$createdAt" }  // date of that top order
    }
  }
])
WARN
$first and $last return the value from the first/last document as they arrive at the $group stage. Without an explicit $sort before $group, the ordering is non-deterministic. Always $sort before $group when $first/$last must return a meaningful value.

$push and $addToSet

// $push — collect all values into an array (allows duplicates)
{
  $group: {
    _id:   "$customerId",
    items: { $push: "$productId" }   // array of all product IDs per customer
  }
}
// Output: { _id: "C001", items: ["P1", "P2", "P1", "P3"] }
// Note: "P1" appears twice (duplicates preserved)

// $addToSet — unique values only (unordered set)
{
  $group: {
    _id:        "$customerId",
    uniqueItems: { $addToSet: "$productId" }
  }
}
// Output: { _id: "C001", uniqueItems: ["P1", "P2", "P3"] }
// Order is not guaranteed in $addToSet output

Accumulator Quick Reference

AccumulatorReturnsNull/Missing behaviour
$sumSum of values; 0 if missingNull/missing = 0
$avgArithmetic meanNull/missing values ignored
$minMinimum valueNull/missing ignored
$maxMaximum valueNull/missing ignored
$countDocument count in groupAlways counts
$firstValue from first doc in groupIf first doc has null/missing → null
$lastValue from last doc in groupIf last doc has null/missing → null
$pushArray of all values (duplicates OK)Null/missing pushed as null
$addToSetArray of unique values (unordered)Null added once if present
$stdDevPopPopulation std deviationNull/missing ignored
$stdDevSampSample std deviationNull/missing ignored
04
Null and Missing Groups
How $group handles null and missing _id values
gotcha

Documents where the group key field is null or missing are all grouped together under the single key null. This can produce unexpected groupings if not anticipated.

// Data: some orders have no "region" field
{ _id: 1, amount: 100, region: "north" }
{ _id: 2, amount: 200, region: null }
{ _id: 3, amount: 150 }                   // region field missing
{ _id: 4, amount: 300, region: "south" }

db.orders.aggregate([
  { $group: { _id: "$region", total: { $sum: "$amount" } } }
])

// Output:
// { _id: "north",  total: 100 }
// { _id: "south",  total: 300 }
// { _id: null,     total: 350 }  ← docs 2 AND 3 grouped together
WARN
Documents with null for the group key and documents with the group key field missing entirely are all collapsed into the same { _id: null } group. If this group appears unexpectedly in your results, pre-filter with $match: { region: { $exists: true, $ne: null } } before $group.

Using the null Group Intentionally

// Intentional null group — all documents into one summary
db.orders.aggregate([
  { $group: { _id: null, total: { $sum: "$amount" }, count: { $sum: 1 } } }
])
// { _id: null, total: 108420, count: 342 }
// Use this pattern for collection-wide aggregation (grand total, average, etc.)
05
Memory Limits
100MB per-stage limit · allowDiskUse · blocking nature
performance

$group is a blocking stage — it must consume the entire input stream before emitting any output. All intermediate group state is held in memory, subject to the 100MB per-stage limit.

The 100MB Limit

// Default: each aggregation stage can use at most 100MB of RAM
// $group holds all group keys + accumulated values in memory simultaneously
// Exceeding 100MB throws: "Exceeded memory limit for $group stage"

// Fix: enable disk spillover
db.orders.aggregate(
  [ { $group: { _id: "$customerId", total: { $sum: "$amount" } } } ],
  { allowDiskUse: true }   // allows spilling to disk past 100MB
)
// Disk spill is much slower than in-memory — pre-filter first
DANGER
allowDiskUse: true is not a free fix — disk I/O is orders of magnitude slower than RAM. It should be a last resort. The right approach is: $match aggressively before $group, $project to minimum fields, and ensure you are grouping on a well-indexed field so fewer documents reach the $group stage.

Reducing $group Memory Usage

// Pattern: filter aggressively BEFORE grouping
db.events.aggregate([

  // 1. Filter early — uses index, reduces stream
  { $match: { eventDate: { $gte: ISODate("2024-01-01") }, type: "purchase" } },

  // 2. Drop unneeded fields BEFORE grouping
  { $project: { userId: 1, amount: 1, _id: 0 } },

  // 3. Group on reduced dataset
  { $group: { _id: "$userId", total: { $sum: "$amount" } } }
])
// Each stage reduces the data volume passed to the next

$group is Blocking — Understand the Implications

StageBlocking?Implication
$matchNo — streamingFirst output immediately
$projectNo — streamingFirst output immediately
$sortYes (without limit)Consumes all input first
$groupYes — alwaysMust see all input; holds state in memory
$lookupNo — streamingEach doc looked up independently
$unwindNo — streamingFan-out, not blocking
06
Compound Keys & Date Grouping
Group by multiple fields and time periods
compound

The _id key can be an embedded document, creating a compound grouping key — equivalent to SQL's GROUP BY field1, field2.

Compound Grouping Key

// Group by region + status — two-field compound key
db.orders.aggregate([
  {
    $group: {
      _id:   { region: "$region", status: "$status" },
      count: { $sum: 1 },
      total: { $sum: "$amount" }
    }
  }
])
// Output:
// { _id: { region: "north", status: "completed" }, count: 42, total: 18200 }
// { _id: { region: "north", status: "pending" },   count:  8, total:  2400 }
// { _id: { region: "south", status: "completed" }, count: 71, total: 31500 }
// ...
// Access compound key after: "$_id.region", "$_id.status"

Group by Date Period

// Group by year + month (monthly report)
db.orders.aggregate([
  {
    $group: {
      _id: {
        year:  { $year:  "$createdAt" },
        month: { $month: "$createdAt" }
      },
      revenue:    { $sum: "$amount" },
      orderCount: { $sum: 1 }
    }
  },
  { $sort: { "_id.year": 1, "_id.month": 1 } }
])

// Group by day of week — e.g., which days are busiest
db.orders.aggregate([
  {
    $group: {
      _id:   { $dayOfWeek: "$createdAt" },  // 1=Sun, 7=Sat
      count: { $sum: 1 }
    }
  }
])

// Group by year only
db.orders.aggregate([
  { $group: { _id: { $year: "$createdAt" }, total: { $sum: "$amount" } } }
])

Access Compound Key in Later Stages

// After $group with compound _id, reference nested fields with dot notation
db.orders.aggregate([
  {
    $group: {
      _id:   { year: { $year: "$createdAt" }, month: { $month: "$createdAt" } },
      total: { $sum: "$amount" }
    }
  },
  { $sort: { "_id.year": 1, "_id.month": 1 } },
  {
    $project: {
      _id:   0,
      year:  "$_id.year",    // promote nested to top level
      month: "$_id.month",
      total: 1
    }
  }
])
07
Common Patterns
Revenue report · Deduplication · Distinct values · Top-N per group
patterns

Revenue Report by Category

db.orders.aggregate([
  { $match: { status: "completed" } },
  {
    $group: {
      _id:        "$category",
      revenue:    { $sum: "$amount" },
      orderCount: { $sum: 1 },
      avgOrder:   { $avg: "$amount" },
      maxOrder:   { $max: "$amount" }
    }
  },
  { $sort: { revenue: -1 } },
  { $limit: 10 }
])

Deduplication via $group

// Remove duplicate emails — keep first occurrence (sorted by _id)
db.signups.aggregate([
  { $sort: { _id: 1 } },
  {
    $group: {
      _id:       "$email",
      docId:     { $first: "$_id" },    // earliest _id for this email
      name:      { $first: "$name" },
      createdAt: { $first: "$createdAt" }
    }
  },
  { $project: { _id: "$docId", email: "$_id", name: 1, createdAt: 1 } }
])

Collect Distinct Values

// Get all distinct tags across all articles
db.articles.aggregate([
  { $unwind: "$tags" },
  { $group: { _id: "$tags" } },
  { $sort:  { _id: 1 } }
])
// Equivalent to: db.articles.distinct("tags") — use distinct() for simplicity

// Collect all unique tags PER category (array per group)
db.articles.aggregate([
  { $unwind: "$tags" },
  {
    $group: {
      _id:  "$category",
      tags: { $addToSet: "$tags" }   // unique set, not ordered
    }
  }
])

Top-N per Group (Top Customer per Region)

// Get the top-spending customer per region
db.orders.aggregate([

  // 1. Group by customer + region to get customer totals
  {
    $group: {
      _id:    { region: "$region", customerId: "$customerId" },
      spend:  { $sum: "$amount" }
    }
  },

  // 2. Sort by spend descending within each region
  { $sort: { "_id.region": 1, spend: -1 } },

  // 3. Re-group by region, collecting top customer (first = highest spend)
  {
    $group: {
      _id:            "$_id.region",
      topCustomer:    { $first: "$_id.customerId" },
      topCustomerSpend: { $first: "$spend" }
    }
  }
])

Count Only — $group vs $count

// Total document count — three equivalent ways
db.col.aggregate([ { $group: { _id: null, n: { $sum: 1 } } } ])
db.col.aggregate([ { $count: "n" } ])              // cleaner syntax
await collection.countDocuments({})                // driver method — fastest

// Count per group — must use $group
db.col.aggregate([ { $group: { _id: "$status", count: { $sum: 1 } } } ])
TIP
When you need a simple total count, countDocuments() is faster than an aggregation pipeline — it runs a query with no data processing. Use $group when you need counts per group or combined with other accumulators in the same stage. Use the $count stage when you need the final count of documents after filtering/transformation.