MongoDB $sample

In MongoDB, the $sample aggregation pipeline stage randomly selects the specified number of documents from its input.

Example

Suppose we have a collection called employees with the following documents:

{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }

We can use the $sample stage to randomly select a specified number of documents from that collection.

Example:

db.employees.aggregate(
   [
      { 
        $sample: { size: 3 } 
      }
   ]
)

Result:

{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }

In this case I specified that the sample size is 3. We can see that that three documents were returned in random order.

Here’s the result if when I run the same code again:

{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }

We get a different selection of documents.

We can increase the sample size by increasing the number.

Example:

db.employees.aggregate(
   [
      { 
        $sample: { size: 5 } 
      }
   ]
)

Result:

{ "_id" : 9, "name" : "Xena", "salary" : 382000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }

Randomly Return All Documents

If the requested sample size matches, or is larger than the number of documents in the collection, all documents are returned in random order.

Example:

db.employees.aggregate(
   [
      { 
        $sample: { size: 100 } 
      }
   ]
)

Result:

{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 1, "name" : "Bob", "salary" : 55000 }

How $sample Calculates the Result

The $sample stage uses one of two methods to produce the result. The actual method used depends on the scenario.

The following table outlines which method is used for each scenario.

ScenarioMethod used to produce the results
All of the following conditions are met:
$sample is the first stage of the pipeline
– The specified sample size is less than 5% of the total documents in the collection
– The collection contains more than 100 documents
$sample uses a pseudo-random cursor to select documents.
All of the above conditions are not met.$sample performs a collection scan followed by a random sort to select the specified number of documents.

Duplicates

The MongoDB documentation warns that $sample may output the same document more than once in its result set.