How to Get a Reproducible Result Set When Using the SAMPLE Clause in DuckDB

When working with large datasets in DuckDB, the SAMPLE clause offers an efficient way to query a subset of your data. However, unless you specifically construct your query to get repeatable results, this sampling will return a different set of results each time the query is run.

But we can change that. We can write our query to return the same random result set every time we run it.

This article explores how to achieve consistent, reproducible result sets when using the SAMPLE clause in DuckDB.

Continue reading

Fix “Sample method … cannot be used with a discrete sample count” When Using the SAMPLE Clause in DuckDB

If you’re getting an error that reads something like “Sample method System cannot be used with a discrete sample count” when using the SAMPLE clause in DuckDB, it looks like you’re specifying an invalid sampling method for the context with which you’re using the SAMPLE clause. Perhaps you’re using system or bernoulli, when you should be using reservoir.

Continue reading

Adding Quotes Around Field Values When Outputting DuckDB Query Results as a List

In this article we look at how to automatically enclose values in single quotes when outputting DuckDB query results as a list. Using this method, each value is formatted as SQL literals. This can be useful when you want to escape the values for SQL or for some other purpose.

We also look at how CSV output is sometimes quoted with double quotes, and how we can adjust the results by doing things like removing the headers and changing the list separator.

Continue reading

Sampling Rows from a Table in DuckDB with the SAMPLE Clause

DuckDB’s SAMPLE clause is a handy feature that allows us to work with a random subset of our data. This is particularly useful when dealing with large datasets where processing the entire dataset might be time-consuming or unnecessary for exploratory data analysis, testing queries, or creating representative samples.

When we use this clause, we can specify the absolute number of rows to return, or a percentage of rows. We also have an option of sampling method to use.

Continue reading