MongoDB $sampleRate

In MongoDB, the $sampleRate aggregation pipeline stage matches a random selection of input documents. 

The number of documents selected approximates the sample rate expressed as a percentage of the total number of documents.

The $sampleRate operator was introduced in MongoDB 4.4.2.

When you use $sampleRate, you provide the sample rate as a floating point number between 0 and 1. The selection process uses a uniform random distribution, and the sample rate you provide represents the probability that a given document will be selected as it passes through the pipeline.

Example

Suppose we have a collection called employees with the following documents:

{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }
{ "_id" : 9, "name" : "Xena", "salary" : 382000 }

We can use the $sample stage to randomly select a specified number of documents from that collection.

Example:

db.employees.aggregate(
   [
      { 
        $match: { $sampleRate: 0.33 } 
      }
   ]
)

Result:

{ "_id" : 1, "name" : "Bob", "salary" : 55000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }

By providing a sample rate of 0.33, we specified that roughly a third of the documents should be returned.

However, the actual result can vary quite significantly, depending on how many documents are in the collection. Collections with a smaller number of documents will have quite varied results, whereas larger collections should be closer to the expected uniform random distribution.

To demonstrate this, here’s the result set I get when I run the same code again:

{ "_id" : 2, "name" : "Sarah", "salary" : 128000 }
{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 4, "name" : "Christopher", "salary" : 45000 }
{ "_id" : 5, "name" : "Beck", "salary" : 82000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }

And again:

{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }
{ "_id" : 8, "name" : "Zoro", "salary" : 300000 }

And once again:

{ "_id" : 3, "name" : "Fritz", "salary" : 25000 }
{ "_id" : 6, "name" : "Homer", "salary" : 1 }
{ "_id" : 7, "name" : "Bartholomew", "salary" : 1582000 }

This is a smaller collection, and so the results vary quite significantly.

If you require an exact number of documents to be returned, use the $sample stage instead.