If you’re getting an error that reads something like “Sample method System cannot be used with a discrete sample count” when using the SAMPLE
clause in DuckDB, it looks like you’re specifying an invalid sampling method for the context with which you’re using the SAMPLE
clause. Perhaps you’re using system or bernoulli, when you should be using reservoir.
The DuckDB documentation states:
All three sampling methods support sampling over a percentage, but only reservoir sampling supports sampling a fixed number of rows.
If you think this is your issue, try switching to the reservoir sampling method. Alternatively, specify a percentage instead of a fixed number of rows.
Example of Error
Here’s an example of code that produces the error:
SELECT * FROM sales USING SAMPLE 10 ROWS (system, 35);
Result:
Parser Error: Sample method System cannot be used with a discrete sample count, either switch to reservoir sampling or use a sample_size
In this case, I passed system
as the sampling method, which resulted in the error.
I get the same error if I try to use bernoulli
:
SELECT * FROM sales USING SAMPLE 10 ROWS (bernoulli, 35);
Result:
Parser Error: Sample method Bernoulli cannot be used with a discrete sample count, either switch to reservoir sampling or use a sample_size
Solution 1
I can fix this issue by using the reservoir
sampling method instead:
SELECT * FROM sales USING SAMPLE 10 ROWS (reservoir, 35);
Here’s the result I get in my test environment:
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 2815 | 16 | 316 | 16 | Product-16 | Electronics | 385.42 | 11 | 2022-09-18 | West |
| 7422 | 23 | 423 | 3 | Product-23 | Home | 848.89 | 9 | 2022-05-03 | East |
| 2031 | 32 | 32 | 12 | Product-32 | Clothing | 830.30 | 10 | 2022-07-26 | West |
| 2572 | 23 | 73 | 13 | Product-23 | Home | 406.92 | 1 | 2022-01-18 | North |
| 247 | 48 | 248 | 8 | Product-48 | Home | 999.04 | 8 | 2022-09-05 | West |
| 3248 | 49 | 249 | 9 | Product-49 | Books | 612.31 | 8 | 2022-11-25 | North |
| 5864 | 15 | 365 | 5 | Product-15 | Food | 931.83 | 3 | 2022-01-25 | North |
| 5393 | 44 | 394 | 14 | Product-44 | Books | 984.55 | 11 | 2022-10-11 | South |
| 3845 | 46 | 346 | 6 | Product-46 | Electronics | 10.79 | 8 | 2022-07-15 | South |
| 1556 | 7 | 57 | 17 | Product-7 | Clothing | 855.63 | 4 | 2022-04-07 | North |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
Using the reservoir
sampling method returned the desired results. In this case, I specified a seed of 35, so if I run the query multiple times, I’ll get the same result (assuming the underlying data set doesn’t change). If I wanted the results to change, then I’d simply remove the seed.
Solution 2
Another way to deal with this issue is to specify a percentage, instead of a fixed number of rows. Doing that will allow you to use system or bernoulli.
Example:
SELECT * FROM sales USING SAMPLE 1% (bernoulli, 35);
Here’s the result I get in my test environment:
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 65 | 16 | 66 | 6 | Product-16 | Electronics | 35.40 | 4 | 2022-03-07 | South |
| 94 | 45 | 95 | 15 | Product-45 | Food | 734.46 | 6 | 2022-04-05 | East |
| 254 | 5 | 255 | 15 | Product-5 | Food | 63.19 | 7 | 2022-09-12 | East |
| 367 | 18 | 368 | 8 | Product-18 | Home | 65.99 | 8 | 2022-01-03 | West |
| 519 | 20 | 20 | 20 | Product-20 | Food | 871.93 | 7 | 2022-06-04 | West |
| 1023 | 24 | 24 | 4 | Product-24 | Books | 725.42 | 2 | 2022-10-21 | West |
| 1029 | 30 | 30 | 10 | Product-30 | Food | 365.08 | 2 | 2022-10-27 | South |
| 1195 | 46 | 196 | 16 | Product-46 | Electronics | 129.45 | 3 | 2022-04-11 | West |
| 1215 | 16 | 216 | 16 | Product-16 | Electronics | 581.68 | 2 | 2022-05-01 | West |
...
In this case, the full result set was too large to paste here. Regardless, it demonstrates that system and bernoulli can be used when specifying a percentage.