RESERVOIR_QUANTILE() Examples in DuckDB

DuckDB includes a reservoir_quantile() function that allows us to compute approximate quantiles efficiently. It provides the approximate quantile using reservoir sampling. This function can be handy when working with large datasets where exact quantile computation would be too slow or resource-intensive.

In this article, we will explore how the reservoir_quantile() function works, along with examples to demonstrate its usage.

What is the reservoir_quantile() Function?

DuckDB’s reservoir_quantile() function is an approximate quantile computation function that uses reservoir sampling to estimate quantiles. Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list of n items, where n is either a very large or unknown number. This makes it particularly suitable for streaming data or large datasets where you want to compute quantiles without scanning the entire dataset.

The function is useful for scenarios where you need to quickly estimate percentiles, medians, or other quantiles without the overhead of exact computation.

Syntax

The syntax for the reservoir_quantile() function is as follows:

reservoir_quantile(value, quantile, sample_size)
  • value: The column or expression for which you want to compute the quantile.
  • quantile: The quantile to compute, expressed as a value between 0 and 1. For example, 0.5 represents the median.
  • sample_size: This is an optional argument that specifies the size of the reservoir sample to use for the approximation. Larger sample sizes will yield more accurate results but will require more memory. The default sample size is 8192.

Sample Data

Let’s walk through an example to demonstrate how to use the reservoir_quantile() function in DuckDB.

First, let’s create and populate a table so we can use it for our examples:

CREATE TABLE sales (
    sale_id INTEGER,
    sale_amount DOUBLE
);

INSERT INTO sales (sale_id, sale_amount)
VALUES
    (1, 100.0),
    (2, 200.0),
    (3, 300.0),
    (4, 400.0),
    (5, 500.0),
    (6, 600.0),
    (7, 700.0),
    (8, 800.0),
    (9, 900.0),
    (10, 1000.0);

This table contains 10 rows of sales data, with each row representing a sale and its corresponding amount.

Example 1: Get the Median Value

Now, let’s compute the median sale amount using the reservoir_quantile() function. The median is the 0.5 quantile.

SELECT 
    reservoir_quantile(sale_amount, 0.5) AS median_sale_amount
FROM sales;

Result:

+--------------------+
| median_sale_amount |
+--------------------+
| 500.0 |
+--------------------+

We can see that the query returned the median value as expected.

Example 2: Calculate the 70th Percentile

We can change the second argument in order to get a different percentile. In this case, we’ll use 0.7 to get the 70th percentile:

SELECT 
    reservoir_quantile(sale_amount, 0.7) AS percentile_70
FROM sales;

Result:

+---------------+
| percentile_70 |
+---------------+
| 700.0 |
+---------------+

Example 3: Change the Sample Size

The default sample size is 8192, which is a lot higher than the number of rows in our table. In this case, the same size would be the whole table.

We can change the sample size by passing a third argument:

SELECT 
    reservoir_quantile(sale_amount, 0.5, 8) AS median_smaller_sample
FROM sales;

Result:

+-----------------------+
| median_smaller_sample |
+-----------------------+
| 400.0 |
+-----------------------+

This time we got the median value from a sample size of 8. This resulted in a lower median value than in our earlier example.

Here’s another one:

SELECT 
    reservoir_quantile(sale_amount, 0.5, 5) AS median_smaller_sample
FROM sales;

Result:

+-----------------------+
| median_smaller_sample |
+-----------------------+
| 300.0 |
+-----------------------+

Example 4: Compute Multiple Quantiles

We can compute multiple quantiles in a single query:

SELECT
    reservoir_quantile(sale_amount, 0.25) AS first_quartile,
    reservoir_quantile(sale_amount, 0.5) AS median,
    reservoir_quantile(sale_amount, 0.75) AS third_quartile
FROM sales;

Result:

+----------------+--------+----------------+
| first_quartile | median | third_quartile |
+----------------+--------+----------------+
| 300.0 | 500.0 | 700.0 |
+----------------+--------+----------------+

Example 5: A Larger Data Set

Let’s look at an example with a larger data set.

To do this, we’ll repopulate our table with 100,000 rows of randomly generated values:

-- Remove all rows from the sales table
TRUNCATE TABLE sales;

-- Populate the table with 100,000 rows of random sales data
INSERT INTO sales (sale_id, sale_amount)
SELECT
    row_number() OVER () AS sale_id,
    round(random() * 100000, 2) + 1 AS sale_amount
FROM generate_series(1, 100000);

-- Check the first 10 rows
SELECT * FROM sales 
ORDER BY sale_id ASC
LIMIT 10;

Output:

+---------+-------------+
| sale_id | sale_amount |
+---------+-------------+
| 1 | 4431.56 |
| 2 | 47772.01 |
| 3 | 19513.71 |
| 4 | 50824.51 |
| 5 | 87226.35 |
| 6 | 18746.87 |
| 7 | 44474.4 |
| 8 | 92862.16 |
| 9 | 7507.74 |
| 10 | 45177.5 |
+---------+-------------+

That’s the first ten rows.

Now let’s get the median value using the default sample size of 8192:

SELECT 
reservoir_quantile(sale_amount, 0.5) AS median_sale_amount
FROM sales;

Result:

+--------------------+
| median_sale_amount |
+--------------------+
| 49965.3 |
+--------------------+

Now let’s change the sample size to 100,000 (the number of rows in the table):

SELECT 
    reservoir_quantile(sale_amount, 0.5, 100000) AS median_sale_amount
FROM sales;

Result:

+--------------------+
| median_sale_amount |
+--------------------+
| 50323.87 |
+--------------------+

We get a slightly different number.

Here’s a query that uses various sample sizes:

SELECT 
    reservoir_quantile(sale_amount, 0.5, 10) AS sample_size_10,
    reservoir_quantile(sale_amount, 0.5, 100) AS sample_size_100,
    reservoir_quantile(sale_amount, 0.5, 1000) AS sample_size_1000,
    reservoir_quantile(sale_amount, 0.5, 10000) AS sample_size_10000
FROM sales;

Result:

+----------------+-----------------+------------------+-------------------+
| sample_size_10 | sample_size_100 | sample_size_1000 | sample_size_10000 |
+----------------+-----------------+------------------+-------------------+
| 47300.72 | 52445.52 | 50670.65 | 50322.79 |
+----------------+-----------------+------------------+-------------------+

We can see that the smaller the sample size, the less accurate the result. However, the smaller sample sizes should run faster than the larger ones.

When to Use reservoir_quantile()

The reservoir_quantile() function can be particularly useful in the following scenarios:

  • Large Datasets: When working with large datasets, exact quantile computation can be slow and resource-intensive. reservoir_quantile() provides a faster alternative with acceptable accuracy.
  • Streaming Data: In scenarios where data is continuously arriving (e.g., real-time analytics), reservoir_quantile() can be used to compute approximate quantiles without needing to store the entire dataset.
  • Memory Constraints: When memory is a constraint, you can control the memory usage by adjusting the sample size.