Fix “Sample method … cannot be used with a discrete sample count” When Using the SAMPLE Clause in DuckDB

If you’re getting an error that reads something like “Sample method System cannot be used with a discrete sample count” when using the SAMPLE clause in DuckDB, it looks like you’re specifying an invalid sampling method for the context with which you’re using the SAMPLE clause. Perhaps you’re using system or bernoulli, when you should be using reservoir.

The DuckDB documentation states:

All three sampling methods support sampling over a percentage, but only reservoir sampling supports sampling a fixed number of rows.

If you think this is your issue, try switching to the reservoir sampling method. Alternatively, specify a percentage instead of a fixed number of rows.

Example of Error

Here’s an example of code that produces the error:

SELECT * FROM sales USING SAMPLE 10 ROWS (system, 35);

Result:

Parser Error: Sample method System cannot be used with a discrete sample count, either switch to reservoir sampling or use a sample_size

In this case, I passed system as the sampling method, which resulted in the error.

I get the same error if I try to use bernoulli:

SELECT * FROM sales USING SAMPLE 10 ROWS (bernoulli, 35);

Result:

Parser Error: Sample method Bernoulli cannot be used with a discrete sample count, either switch to reservoir sampling or use a sample_size

Solution 1

I can fix this issue by using the reservoir sampling method instead:

SELECT * FROM sales USING SAMPLE 10 ROWS (reservoir, 35);

Here’s the result I get in my test environment:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 2815 | 16 | 316 | 16 | Product-16 | Electronics | 385.42 | 11 | 2022-09-18 | West |
| 7422 | 23 | 423 | 3 | Product-23 | Home | 848.89 | 9 | 2022-05-03 | East |
| 2031 | 32 | 32 | 12 | Product-32 | Clothing | 830.30 | 10 | 2022-07-26 | West |
| 2572 | 23 | 73 | 13 | Product-23 | Home | 406.92 | 1 | 2022-01-18 | North |
| 247 | 48 | 248 | 8 | Product-48 | Home | 999.04 | 8 | 2022-09-05 | West |
| 3248 | 49 | 249 | 9 | Product-49 | Books | 612.31 | 8 | 2022-11-25 | North |
| 5864 | 15 | 365 | 5 | Product-15 | Food | 931.83 | 3 | 2022-01-25 | North |
| 5393 | 44 | 394 | 14 | Product-44 | Books | 984.55 | 11 | 2022-10-11 | South |
| 3845 | 46 | 346 | 6 | Product-46 | Electronics | 10.79 | 8 | 2022-07-15 | South |
| 1556 | 7 | 57 | 17 | Product-7 | Clothing | 855.63 | 4 | 2022-04-07 | North |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

Using the reservoir sampling method returned the desired results. In this case, I specified a seed of 35, so if I run the query multiple times, I’ll get the same result (assuming the underlying data set doesn’t change). If I wanted the results to change, then I’d simply remove the seed.

Solution 2

Another way to deal with this issue is to specify a percentage, instead of a fixed number of rows. Doing that will allow you to use system or bernoulli.

Example:

SELECT * FROM sales USING SAMPLE 1% (bernoulli, 35);

Here’s the result I get in my test environment:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 65 | 16 | 66 | 6 | Product-16 | Electronics | 35.40 | 4 | 2022-03-07 | South |
| 94 | 45 | 95 | 15 | Product-45 | Food | 734.46 | 6 | 2022-04-05 | East |
| 254 | 5 | 255 | 15 | Product-5 | Food | 63.19 | 7 | 2022-09-12 | East |
| 367 | 18 | 368 | 8 | Product-18 | Home | 65.99 | 8 | 2022-01-03 | West |
| 519 | 20 | 20 | 20 | Product-20 | Food | 871.93 | 7 | 2022-06-04 | West |
| 1023 | 24 | 24 | 4 | Product-24 | Books | 725.42 | 2 | 2022-10-21 | West |
| 1029 | 30 | 30 | 10 | Product-30 | Food | 365.08 | 2 | 2022-10-27 | South |
| 1195 | 46 | 196 | 16 | Product-46 | Electronics | 129.45 | 3 | 2022-04-11 | West |
| 1215 | 16 | 216 | 16 | Product-16 | Electronics | 581.68 | 2 | 2022-05-01 | West |
...

In this case, the full result set was too large to paste here. Regardless, it demonstrates that system and bernoulli can be used when specifying a percentage.