In MongoDB, the $sample
aggregation pipeline stage randomly selects the specified number of documents from its input.
Example
Suppose we have a collection called employees
with the following documents:
{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }
We can use the $sample
stage to randomly select a specified number of documents from that collection.
Example:
db.employees.aggregate(
[
{
$sample: { size: 3 }
}
]
)
Result:
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 } { "_id" : 3, "name" : "Fritz", "salary" : 25000 } { "_id" : 2, "name" : "Sarah", "salary" : 128000 }
In this case I specified that the sample size is 3. We can see that that three documents were returned in random order.
Here’s the result if when I run the same code again:
{ "_id" : 1, "name" : "Bob", "salary" : 55000 } { "_id" : 2, "name" : "Sarah", "salary" : 128000 } { "_id" : 9, "name" : "Xena", "salary" : 382000 }
We get a different selection of documents.
We can increase the sample size by increasing the number.
Example:
db.employees.aggregate(
[
{
$sample: { size: 5 }
}
]
)
Result:
{ "_id" : 9, "name" : "Xena", "salary" : 382000 } { "_id" : 3, "name" : "Fritz", "salary" : 25000 } { "_id" : 4, "name" : "Christopher", "salary" : 45000 } { "_id" : 8, "name" : "Zoro", "salary" : 300000 } { "_id" : 5, "name" : "Beck", "salary" : 82000 }
Randomly Return All Documents
If the requested sample size matches, or is larger than the number of documents in the collection, all documents are returned in random order.
Example:
db.employees.aggregate(
[
{
$sample: { size: 100 }
}
]
)
Result:
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 } { "_id" : 8, "name" : "Zoro", "salary" : 300000 } { "_id" : 5, "name" : "Beck", "salary" : 82000 } { "_id" : 2, "name" : "Sarah", "salary" : 128000 } { "_id" : 6, "name" : "Homer", "salary" : 1 } { "_id" : 9, "name" : "Xena", "salary" : 382000 } { "_id" : 3, "name" : "Fritz", "salary" : 25000 } { "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 } { "_id" : 1, "name" : "Bob", "salary" : 55000 }
How $sample
Calculates the Result
The $sample
stage uses one of two methods to produce the result. The actual method used depends on the scenario.
The following table outlines which method is used for each scenario.
Scenario | Method used to produce the results |
---|---|
All of the following conditions are met: – $sample is the first stage of the pipeline– The specified sample size is less than 5% of the total documents in the collection – The collection contains more than 100 documents | $sample uses a pseudo-random cursor to select documents. |
All of the above conditions are not met. | $sample performs a collection scan followed by a random sort to select the specified number of documents. |
Duplicates
The MongoDB documentation warns that $sample
may output the same document more than once in its result set.