Sampling Rows from a Table in DuckDB with the SAMPLE Clause

DuckDB’s SAMPLE clause is a handy feature that allows us to work with a random subset of our data. This is particularly useful when dealing with large datasets where processing the entire dataset might be time-consuming or unnecessary for exploratory data analysis, testing queries, or creating representative samples.

When we use this clause, we can specify the absolute number of rows to return, or a percentage of rows. We also have an option of sampling method to use.

Specifying a Row Count

As I just mentioned, one way to use the clause is to specify the number of rows we want returned. We can specify just the number, or we can follow up the number with the ROWS keyword:

SELECT * FROM sales USING SAMPLE 10;
SELECT * FROM sales USING SAMPLE 10 ROWS;

Both of those queries sample ten rows from the sales table.

Specifying a Percentage

Alternatively, we can specify percentages. We can use the percent sign (%) or spell it out as PERCENT:

SELECT * FROM sales USING SAMPLE 5%;
SELECT * FROM sales USING SAMPLE 5 PERCENT;

Both of those queries return a sample size of ten percent of the table.

Specifying the Sampling Method

We can also specify the sampling method to use. We have three sampling methods to choose from:

  • Reservoir
  • Bernoulli
  • System

Here are examples of specifying the sampling method:

SELECT * FROM sales USING SAMPLE 5 PERCENT (reservoir);
SELECT * FROM sales USING SAMPLE 5 PERCENT (bernoulli);
SELECT * FROM sales USING SAMPLE 5 PERCENT (system);

We can also use the following syntax:

SELECT * FROM sales USING SAMPLE reservoir(20%);

The sampling methods are explained further below.

Reservoir Sampling

Reservoir sampling selects a fixed-size random sample by maintaining a reservoir and replacing elements as new data arrives. Unlike Bernoulli and system sampling, it always produces the exact number of requested elements.

However, it is inefficient for large samples or percentage-based sampling since it must materialize the full sample and replace elements dynamically. Performance worsens with multi-processing, as threads must share the reservoir, adding overhead.

Avoid reservoir sampling for large datasets due to its memory and performance costs.

Bernoulli Sampling

Bernoulli sampling selects each row independently with a probability equal to the specified percentage, leading to variance in sample size. The expected row count matches the percentage of the table, but actual results vary.

Since it operates independently without shared state, Bernoulli sampling works efficiently with multiple threads.

Bernoulli sampling can only be used when a sampling percentage is specified. 

System Sampling

System sampling, a cluster-based variant of Bernoulli sampling, selects entire vectors based on the sampling percentage. This makes it more efficient, as individual row selections are not required.

While the expected row count remains the same, variance can be significant. For datasets smaller than ~10k rows, results may be extreme (e.g., all or no rows selected).

Avoid this method on smaller datasets.

Creating a Reproducible Result Set

It’s possible to have the same result set returned each time the query is run (assuming the underlying data set doesn’t change). This can be handy for debugging or testing when we need consistency in the output.

We can do this by providing a seed value along with the REPEATABLE keyword. This provides deterministic sampling by using a fixed random seed. This means running the same SAMPLE query with the same seed will always return the same subset of rows.

There are a couple of ways to do this, as illustrated below:

SELECT * FROM sales USING SAMPLE reservoir(10 ROWS) REPEATABLE(35);
SELECT * FROM sales USING SAMPLE 10 ROWS (reservoir, 35);

These queries both return the same result set.

In the second query, we didn’t use the REPEATABLE keyword. The syntax allows us to omit that keyword, and just use the seed value. By providing the seed value, that’s enough for DuckDB to know what we want to do.

If the underlying data doesn’t change, then those queries will return the same result whenever they’re run. If we want a different result set, we can use a different seed. For example, changing the seed to 36 will return a different result. But then changing it back to 35 will return the original result set again.

This technique can be very useful whenever you need reproducible results, such as:

  • Debugging or testing queries with consistent output
  • Training models where sample consistency is needed
  • Ensuring fair comparisons between different query runs

Without REPEATABLE, each query execution will produce a different random sample.