In MongoDB, the $sample aggregation pipeline stage randomly selects the specified number of documents from its input.
Example
Suppose we have a collection called employees with the following documents:
{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }
We can use the $sample stage to randomly select a specified number of documents from that collection.
Example:
db.employees.aggregate(
[
{
$sample: { size: 3 }
}
]
)
Result:
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
In this case I specified that the sample size is 3. We can see that that three documents were returned in random order.
Here’s the result if when I run the same code again:
{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }
We get a different selection of documents.
We can increase the sample size by increasing the number.
Example:
db.employees.aggregate(
[
{
$sample: { size: 5 }
}
]
)
Result:
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
Randomly Return All Documents
If the requested sample size matches, or is larger than the number of documents in the collection, all documents are returned in random order.
Example:
db.employees.aggregate(
[
{
$sample: { size: 100 }
}
]
)
Result:
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
How $sample Calculates the Result
The $sample stage uses one of two methods to produce the result. The actual method used depends on the scenario.
The following table outlines which method is used for each scenario.
| Scenario | Method used to produce the results |
|---|---|
| All of the following conditions are met: – $sample is the first stage of the pipeline– The specified sample size is less than 5% of the total documents in the collection – The collection contains more than 100 documents | $sample uses a pseudo-random cursor to select documents. |
| All of the above conditions are not met. | $sample performs a collection scan followed by a random sort to select the specified number of documents. |
Duplicates
The MongoDB documentation warns that $sample may output the same document more than once in its result set.