In MongoDB, the $sampleRate
aggregation pipeline stage matches a random selection of input documents.
The number of documents selected approximates the sample rate expressed as a percentage of the total number of documents.
The $sampleRate
operator was introduced in MongoDB 4.4.2.
When you use $sampleRate
, you provide the sample rate as a floating point number between 0
and 1
. The selection process uses a uniform random distribution, and the sample rate you provide represents the probability that a given document will be selected as it passes through the pipeline.
Example
Suppose we have a collection called employees
with the following documents:
{ "_id" : 1, "name" : "Bob", "salary" : 55000 } { "_id" : 2, "name" : "Sarah", "salary" : 128000 } { "_id" : 3, "name" : "Fritz", "salary" : 25000 } { "_id" : 4, "name" : "Christopher", "salary" : 45000 } { "_id" : 5, "name" : "Beck", "salary" : 82000 } { "_id" : 6, "name" : "Homer", "salary" : 1 } { "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 } { "_id" : 8, "name" : "Zoro", "salary" : 300000 } { "_id" : 9, "name" : "Xena", "salary" : 382000 }
We can use the $sample
stage to randomly select a specified number of documents from that collection.
Example:
db.employees.aggregate(
[
{
$match: { $sampleRate: 0.33 }
}
]
)
Result:
{ "_id" : 1, "name" : "Bob", "salary" : 55000 } { "_id" : 6, "name" : "Homer", "salary" : 1 } { "_id" : 8, "name" : "Zoro", "salary" : 300000 }
By providing a sample rate of 0.33
, we specified that roughly a third of the documents should be returned.
However, the actual result can vary quite significantly, depending on how many documents are in the collection. Collections with a smaller number of documents will have quite varied results, whereas larger collections should be closer to the expected uniform random distribution.
To demonstrate this, here’s the result set I get when I run the same code again:
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 } { "_id" : 3, "name" : "Fritz", "salary" : 25000 } { "_id" : 4, "name" : "Christopher", "salary" : 45000 } { "_id" : 5, "name" : "Beck", "salary" : 82000 } { "_id" : 6, "name" : "Homer", "salary" : 1 } { "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
And again:
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 } { "_id" : 8, "name" : "Zoro", "salary" : 300000 }
And once again:
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 } { "_id" : 6, "name" : "Homer", "salary" : 1 } { "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
This is a smaller collection, and so the results vary quite significantly.
If you require an exact number of documents to be returned, use the $sample
stage instead.