How to Get a Reproducible Result Set When Using the SAMPLE Clause in DuckDB

When working with large datasets in DuckDB, the SAMPLE clause offers an efficient way to query a subset of your data. However, unless you specifically construct your query to get repeatable results, this sampling will return a different set of results each time the query is run.

But we can change that. We can write our query to return the same random result set every time we run it.

This article explores how to achieve consistent, reproducible result sets when using the SAMPLE clause in DuckDB.

Setting a Seed for Reproducible Sampling

The key to getting consistent results with the SAMPLE clause is to set a seed. We can also use the REPEATABLE keyword to make it explicit. Setting a seed initializes the random number generator with the seed’s value, ensuring that the same “random” rows are selected each time.

There are a couple of ways to do this:

SELECT * FROM sales USING SAMPLE reservoir(5 ROWS) REPEATABLE(35);
SELECT * FROM sales USING SAMPLE 5 ROWS (reservoir, 35);

Both of those queries return the same result, and will continue to do so if we run them repeatedly.

In the second query, we didn’t use the REPEATABLE keyword. The syntax allows us to omit that keyword, and just use the seed value. By providing the seed value, DuckDB knows that we want to use that seed to produce repeatable results.

Example in Practice

Let’s run the above queries against an actual table to see the repeatable results. To do that, we’ll create a table and populate it with 10,000 rows of dummy data:

CREATE TABLE sales (
    sale_id INTEGER,
    product_id INTEGER,
    customer_id INTEGER,
    store_id INTEGER,
    product_name VARCHAR,
    category VARCHAR,
    price DECIMAL(10,2),
    quantity INTEGER,
    sale_date DATE,
    region VARCHAR
);

INSERT INTO sales
SELECT 
    i AS sale_id,
    (i % 50) + 1 AS product_id,
    (i % 500) + 1 AS customer_id,
    (i % 20) + 1 AS store_id,
    'Product-' || ((i % 50) + 1) AS product_name,
    CASE (i % 5)
        WHEN 0 THEN 'Electronics'
        WHEN 1 THEN 'Clothing'
        WHEN 2 THEN 'Home'
        WHEN 3 THEN 'Books'
        WHEN 4 THEN 'Food'
    END AS category,
    (random() * 1000)::DECIMAL(10,2) AS price,
    (random() * 10 + 1)::INTEGER AS quantity,
    DATE '2022-01-01' + INTERVAL (i % 365) DAY AS sale_date,  -- Fixed date calculation
    CASE (i % 4)
        WHEN 0 THEN 'North'
        WHEN 1 THEN 'South'
        WHEN 2 THEN 'East'
        WHEN 3 THEN 'West'
    END AS region
FROM range(1, 10001) t(i);

Now let’s run some queries against that table using the SAMPLE clause.

Here’s one of the queries I presented earlier:

SELECT * FROM sales USING SAMPLE reservoir(5 ROWS) REPEATABLE(35);

Result:

+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

This is a random sample of the table. If we run the query again with the same seed value, the same result set will be returned.

To demonstrate this, here’s the output from my terminal after I ran the query twice:

D SELECT * FROM sales USING SAMPLE reservoir(5 ROWS) REPEATABLE(35);
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
D SELECT * FROM sales USING SAMPLE reservoir(5 ROWS) REPEATABLE(35);
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

We’ll get the same result if we use the more concise syntax:

SELECT * FROM sales USING SAMPLE 5 ROWS (reservoir, 35);

Output:

+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

If we change the seed, we’ll get a different result:

SELECT * FROM sales USING SAMPLE 5 ROWS (reservoir, 36);

Output:

+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 8988 | 39 | 489 | 9 | Product-39 | Books | 223.53 | 2 | 2022-08-17 | North |
| 9662 | 13 | 163 | 3 | Product-13 | Home | 918.38 | 9 | 2022-06-22 | East |
| 6396 | 47 | 397 | 17 | Product-47 | Clothing | 614.42 | 3 | 2022-07-11 | North |
| 5638 | 39 | 139 | 19 | Product-39 | Books | 380.72 | 4 | 2022-06-13 | East |
| 6641 | 42 | 142 | 2 | Product-42 | Clothing | 578.90 | 6 | 2022-03-13 | South |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

And if we revert the seed to the original value, we get the original result set again:

SELECT * FROM sales USING SAMPLE 5 ROWS (reservoir, 35);

Output:

+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

Returning Non-Reproducible Results

If we want a different result set each time we run the query, then we can simply remove the seed (and the REPEATABLE keyword if you’re using it).

Here’s the output in my terminal when I run the query multiple times without a seed:

D SELECT * FROM sales USING SAMPLE reservoir(5 ROWS);
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 871 | 22 | 372 | 12 | Product-22 | Clothing | 678.96 | 7 | 2022-05-22 | West |
| 3723 | 24 | 224 | 4 | Product-24 | Books | 351.38 | 11 | 2022-03-15 | West |
| 9039 | 40 | 40 | 20 | Product-40 | Food | 94.33 | 4 | 2022-10-07 | West |
| 6340 | 41 | 341 | 1 | Product-41 | Electronics | 515.97 | 2 | 2022-05-16 | North |
| 1369 | 20 | 370 | 10 | Product-20 | Food | 776.30 | 9 | 2022-10-02 | South |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
D SELECT * FROM sales USING SAMPLE reservoir(5 ROWS);
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2197 | 48 | 198 | 18 | Product-48 | Home | 526.15 | 9 | 2022-01-08 | South |
| 6119 | 20 | 120 | 20 | Product-20 | Food | 953.39 | 1 | 2022-10-07 | West |
| 9557 | 8 | 58 | 18 | Product-8 | Home | 137.04 | 3 | 2022-03-09 | South |
| 8229 | 30 | 230 | 10 | Product-30 | Food | 800.54 | 6 | 2022-07-19 | South |
| 8212 | 13 | 213 | 13 | Product-13 | Home | 677.58 | 9 | 2022-07-02 | North |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

We get a different result each time the query is run.

Of course, we can always switch back to a seed when we want repeatable results again.

Using Percentages

In the previous examples, I specified a fixed number of rows to return (5 rows). We can also use seeds when specifying a percentage.

Example:

SELECT * FROM sales USING SAMPLE 1% (bernoulli, 35);

Output:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 65 | 16 | 66 | 6 | Product-16 | Electronics | 35.40 | 4 | 2022-03-07 | South |
| 94 | 45 | 95 | 15 | Product-45 | Food | 734.46 | 6 | 2022-04-05 | East |
| 254 | 5 | 255 | 15 | Product-5 | Food | 63.19 | 7 | 2022-09-12 | East |
| 367 | 18 | 368 | 8 | Product-18 | Home | 65.99 | 8 | 2022-01-03 | West |
| 519 | 20 | 20 | 20 | Product-20 | Food | 871.93 | 7 | 2022-06-04 | West |
| 1023 | 24 | 24 | 4 | Product-24 | Books | 725.42 | 2 | 2022-10-21 | West |
| 1029 | 30 | 30 | 10 | Product-30 | Food | 365.08 | 2 | 2022-10-27 | South |
...

This query returned too many rows to paste them all here, but you get the picture.

If I run the query again, it produces the same result set:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 65 | 16 | 66 | 6 | Product-16 | Electronics | 35.40 | 4 | 2022-03-07 | South |
| 94 | 45 | 95 | 15 | Product-45 | Food | 734.46 | 6 | 2022-04-05 | East |
| 254 | 5 | 255 | 15 | Product-5 | Food | 63.19 | 7 | 2022-09-12 | East |
| 367 | 18 | 368 | 8 | Product-18 | Home | 65.99 | 8 | 2022-01-03 | West |
| 519 | 20 | 20 | 20 | Product-20 | Food | 871.93 | 7 | 2022-06-04 | West |
| 1023 | 24 | 24 | 4 | Product-24 | Books | 725.42 | 2 | 2022-10-21 | West |
| 1029 | 30 | 30 | 10 | Product-30 | Food | 365.08 | 2 | 2022-10-27 | South |
| 1195 | 46 | 196 | 16 | Product-46 | Electronics | 129.45 | 3 | 2022-04-11 | West |
| 1215 | 16 | 216 | 16 | Product-16 | Electronics | 581.68 | 2 | 2022-05-01 | West |
...

When to Use Reproducible Sampling

Reproducible sampling can be particularly useful for:

  • Debugging complex queries
  • Sharing results with colleagues
  • Ensuring consistent benchmarks
  • Creating stable test datasets

Multi-Threading

The DuckDB documentation warns the following about using seeds with multi-threading:

Specifying the seed only guarantees that the sample is the same if multi-threading is not enabled (i.e., SET threads = 1). In the case of multiple threads running over a sample, samples are not necessarily consistent even with a fixed seed.

So that’s something to bear in mind.