LIMIT vs SAMPLE in DuckDB: What’s the Difference?

DuckDB’s LIMIT and SAMPLE clauses both allow users to work with subsets of data, but they serve fundamentally different purposes and operate on different principles. While LIMIT returns a fixed number of rows from the beginning of a result set, SAMPLE provides a random sample from the dataset.

The LIMIT Clause

LIMIT is a straightforward clause that restricts the number of rows returned by a query. It works by taking the first N rows from the result set after all other query operations (filtering, sorting, etc.) have been applied.

Here’s an example of a query that uses the LIMIT clause:

SELECT * FROM sales 
ORDER BY sale_id
LIMIT 5;

Example output:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 1 | 2 | 2 | 2 | Product-2 | Clothing | 379.64 | 5 | 2022-01-02 | South |
| 2 | 3 | 3 | 3 | Product-3 | Home | 371.73 | 4 | 2022-01-03 | East |
| 3 | 4 | 4 | 4 | Product-4 | Books | 728.64 | 7 | 2022-01-04 | West |
| 4 | 5 | 5 | 5 | Product-5 | Food | 311.09 | 2 | 2022-01-05 | North |
| 5 | 6 | 6 | 6 | Product-6 | Electronics | 828.57 | 2 | 2022-01-06 | South |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

In this example I specified LIMIT 5 to return the first five rows from the table.

We can also specify an offset when using the LIMIT clause:

SELECT * FROM sales 
ORDER BY sale_id
LIMIT 5 OFFSET 10;

Result:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 11 | 12 | 12 | 12 | Product-12 | Clothing | 670.78 | 4 | 2022-01-12 | West |
| 12 | 13 | 13 | 13 | Product-13 | Home | 951.29 | 2 | 2022-01-13 | North |
| 13 | 14 | 14 | 14 | Product-14 | Books | 648.17 | 8 | 2022-01-14 | South |
| 14 | 15 | 15 | 15 | Product-15 | Food | 983.63 | 4 | 2022-01-15 | East |
| 15 | 16 | 16 | 16 | Product-16 | Electronics | 997.18 | 4 | 2022-01-16 | West |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

Using an offset allows us to get the first N rows starting from a position based on the offset amount. Specifically, it ignores the number of rows specified in the OFFSET amount. So OFFSET 10 ignores the first 10 rows.

Key characteristics of LIMIT:

  • Deterministic: The same query always returns the same rows. At least, this is the case when used with the ORDER BY clause. Omitting an ORDER BY clause could return non-deterministic results.
  • Sequential: Takes rows from the start of the result set
  • Order-dependent: Results depend on the ORDER BY clause
  • Commonly used for pagination and preview purposes

The SAMPLE Clause

SAMPLE, on the other hand, is designed for statistical sampling of data. It allows users to retrieve a representative subset based on a percentage or row count, using various sampling methods.

Here’s an example of querying the same table with SAMPLE instead of LIMIT:

SELECT * FROM sales 
USING SAMPLE 5 ROWS;

Example output:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 9244 | 45 | 245 | 5 | Product-45 | Food | 619.30 | 2 | 2022-04-30 | North |
| 3369 | 20 | 370 | 10 | Product-20 | Food | 325.75 | 10 | 2022-03-26 | South |
| 1775 | 26 | 276 | 16 | Product-26 | Electronics | 115.83 | 8 | 2022-11-12 | West |
| 2346 | 47 | 347 | 7 | Product-47 | Clothing | 457.84 | 10 | 2022-06-06 | East |
| 5694 | 45 | 195 | 15 | Product-45 | Food | 160.57 | 1 | 2022-08-08 | East |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

We can see that it took five random rows from throughout the table. If we run it again, it will take five other rows. To illustrate this, here’s the output from my terminal when I ran it two more times:

D SELECT * FROM sales 
USING SAMPLE 5 ROWS;
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 417 | 18 | 418 | 18 | Product-18 | Home | 338.57 | 5 | 2022-02-22 | South |
| 3604 | 5 | 105 | 5 | Product-5 | Food | 368.24 | 9 | 2022-11-16 | North |
| 7313 | 14 | 314 | 14 | Product-14 | Books | 985.18 | 11 | 2022-01-14 | South |
| 944 | 45 | 445 | 5 | Product-45 | Food | 535.96 | 9 | 2022-08-03 | North |
| 9727 | 28 | 228 | 8 | Product-28 | Home | 706.36 | 6 | 2022-08-26 | West |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
D SELECT * FROM sales
USING SAMPLE 5 ROWS;
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 6608 | 9 | 109 | 9 | Product-9 | Books | 296.19 | 8 | 2022-02-08 | North |
| 637 | 38 | 138 | 18 | Product-38 | Home | 846.30 | 5 | 2022-09-30 | South |
| 3535 | 36 | 36 | 16 | Product-36 | Electronics | 805.70 | 7 | 2022-09-08 | West |
| 516 | 17 | 17 | 17 | Product-17 | Clothing | 34.73 | 5 | 2022-06-01 | North |
| 339 | 40 | 340 | 20 | Product-40 | Food | 66.43 | 6 | 2022-12-06 | West |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

However, it is possible to return repeatable results when using the SAMPLE clause. To do this we need to set a seed. The seed is used to initialize the random number generator, which results in the same “random” rows being returned.

Here’s an example of that:

SELECT * FROM sales USING SAMPLE reservoir(5 ROWS) REPEATABLE(35);

Result:

+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

If we run the query again with the same seed value, the same result set will be returned.

Here’s the output from my terminal when I run the query two more times:

D SELECT * FROM sales USING SAMPLE reservoir(5 ROWS) REPEATABLE(35);
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
D SELECT * FROM sales USING SAMPLE reservoir(5 ROWS) REPEATABLE(35);
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 2092 | 43 | 93 | 13 | Product-43 | Home | 383.91 | 6 | 2022-09-25 | North |
| 1653 | 4 | 154 | 14 | Product-4 | Books | 612.94 | 9 | 2022-07-13 | South |
| 4463 | 14 | 464 | 4 | Product-14 | Books | 896.15 | 6 | 2022-03-25 | West |
| 2733 | 34 | 234 | 14 | Product-34 | Books | 822.35 | 7 | 2022-06-28 | South |
| 166 | 17 | 167 | 7 | Product-17 | Clothing | 144.15 | 8 | 2022-06-16 | East |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

So I ran the query three times and got the same result all three times. However, in all cases SAMPLE returned completely different to the results to the LIMIT clause.

Using Percentages

In addition to specifying a fixed row count like in the previous examples, both SAMPLE and LIMIT also accept a percentage amount. So there’s no difference between the functions in that regard. When we specify a percentage, the query returns that percentage of the rows from the dataset.

Here’s an example of using a percentage with the LIMIT clause:

SELECT * FROM sales 
ORDER BY sale_id
LIMIT 1%;

Partial output:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 1 | 2 | 2 | 2 | Product-2 | Clothing | 379.64 | 5 | 2022-01-02 | South |
| 2 | 3 | 3 | 3 | Product-3 | Home | 371.73 | 4 | 2022-01-03 | East |
| 3 | 4 | 4 | 4 | Product-4 | Books | 728.64 | 7 | 2022-01-04 | West |

...

| 98 | 49 | 99 | 19 | Product-49 | Books | 194.05 | 9 | 2022-04-09 | East |
| 99 | 50 | 100 | 20 | Product-50 | Food | 102.27 | 6 | 2022-04-10 | West |
| 100 | 1 | 101 | 1 | Product-1 | Electronics | 666.52 | 4 | 2022-04-11 | North |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

The table contains 10,000 rows, and so that query returned 100 rows. Rather than displaying all 100 rows, I displayed just the first and last three.

And here’s an example of using a percentage with the SAMPLE clause:

SELECT * FROM sales USING SAMPLE reservoir(1%);

Result:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 839 | 40 | 340 | 20 | Product-40 | Food | 25.08 | 5 | 2022-04-20 | West |
| 2324 | 25 | 325 | 5 | Product-25 | Food | 879.04 | 5 | 2022-05-15 | North |
| 7109 | 10 | 110 | 10 | Product-10 | Food | 825.38 | 2 | 2022-06-24 | South |

...

| 9939 | 40 | 440 | 20 | Product-40 | Food | 79.92 | 1 | 2022-03-26 | West |
| 7470 | 21 | 471 | 11 | Product-21 | Electronics | 169.55 | 1 | 2022-06-20 | East |
| 845 | 46 | 346 | 6 | Product-46 | Electronics | 69.36 | 5 | 2022-04-26 | South |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

Again, that’s the first and last three results from a result set of 100.

In this example I also specified the sampling method to use (Reservoir). This is another area that differs from the LIMIT clause (sampling methods don’t apply to the LIMIT clause).

Key characteristics of SAMPLE:

  • Probabilistic: Uses random sampling techniques
  • Designed for statistical analysis and data exploration
  • Offers multiple sampling methods (Bernoulli, Reservoir, System)
  • Can produce reproducible random samples with a seed
  • More efficient for large datasets where proportional representation matters

Using Both

It’s possible to use both SAMPLE and LIMIT together in the same query:

SELECT * FROM sales USING SAMPLE reservoir(10 ROWS) REPEATABLE(35)
ORDER BY sale_id
LIMIT 3;

Result:

+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+
| 247 | 48 | 248 | 8 | Product-48 | Home | 9.27 | 6 | 2022-09-05 | West |
| 1556 | 7 | 57 | 17 | Product-7 | Clothing | 36.34 | 3 | 2022-04-07 | North |
| 2031 | 32 | 32 | 12 | Product-32 | Clothing | 336.32 | 3 | 2022-07-26 | West |
+---------+------------+-------------+----------+--------------+----------+--------+----------+------------+--------+

In this example, the SAMPLE clause took a sample of ten random rows, and the LIMIT clause limited the output to just three of those rows.

Let’s increase the LIMIT to five rows:

SELECT * FROM sales USING SAMPLE reservoir(10 ROWS) REPEATABLE(35)
ORDER BY sale_id
LIMIT 5;

Result:

+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| sale_id | product_id | customer_id | store_id | product_name | category | price | quantity | sale_date | region |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+
| 247 | 48 | 248 | 8 | Product-48 | Home | 9.27 | 6 | 2022-09-05 | West |
| 1556 | 7 | 57 | 17 | Product-7 | Clothing | 36.34 | 3 | 2022-04-07 | North |
| 2031 | 32 | 32 | 12 | Product-32 | Clothing | 336.32 | 3 | 2022-07-26 | West |
| 2572 | 23 | 73 | 13 | Product-23 | Home | 600.61 | 6 | 2022-01-18 | North |
| 2815 | 16 | 316 | 16 | Product-16 | Electronics | 709.15 | 9 | 2022-09-18 | West |
+---------+------------+-------------+----------+--------------+-------------+--------+----------+------------+--------+

In both queries the SAMPLE clause uses a seed of 35, and so the same results are returned as far as the SAMPLE clause goes. It was then up to the LIMIT clause to restrict those results to the amounts we provide (i.e. 3 and 5).