DuckDB is an in-process SQL OLAP database management system designed for analytical workloads. One of its handy features is the ability to compute approximate quantiles efficiently using the approx_quantile()
function. This function is particularly useful when working with large datasets where exact quantile computation would be computationally expensive.
In this article, we will explore how the approx_quantile()
function works, its syntax, and provide examples to demonstrate its usage.
What is the approx_quantile()
Function?
The approx_quantile()
function in DuckDB computes an approximate quantile of a given column.
Quantiles are points that divide a dataset into equal-sized intervals. For example, the median is the 0.5 quantile, meaning it divides the data into two equal parts. The approx_quantile()
function is useful when you need to estimate quantiles quickly, especially in large datasets, without the overhead of exact computation.
Syntax
The syntax for the approx_quantile()
function is as follows:
approx_quantile(column, quantile)
column
: The column for which you want to compute the quantile.quantile
: The quantile value you want to compute, which must be between 0 and 1.
Sample Data
Before we dive into examples, let’s create a table and populate it with some data. We’ll use this table to demonstrate how the approx_quantile()
function works:
-- Create the table
CREATE TABLE sales (
id INTEGER,
amount DOUBLE,
order_date DATE,
region TEXT
);
-- Insert sample data
INSERT INTO sales (id, amount, order_date, region) VALUES
(1, 200.50, '2021-01-01', 'North'),
(2, 50.00, '2022-02-15', 'South'),
(3, 150.33, '2023-03-10', 'North'),
(4, 970.75, '2023-04-22', 'East'),
(5, 250.00, '2023-05-05', 'South'),
(6, 400.00, '2023-06-18', 'East'),
(7, 350.30, '2023-07-30', 'North'),
(8, 500.99, '2023-08-12', 'West'),
(9, 600.50, '2024-09-25', 'West'),
(10, 1450.00, '2024-10-10', 'South');
-- Return all data
SELECT * FROM sales;
Output:
+----+--------+------------+--------+
| id | amount | order_date | region |
+----+--------+------------+--------+
| 1 | 200.5 | 2021-01-01 | North |
| 2 | 50.0 | 2022-02-15 | South |
| 3 | 150.33 | 2023-03-10 | North |
| 4 | 970.75 | 2023-04-22 | East |
| 5 | 250.0 | 2023-05-05 | South |
| 6 | 400.0 | 2023-06-18 | East |
| 7 | 350.3 | 2023-07-30 | North |
| 8 | 500.99 | 2023-08-12 | West |
| 9 | 600.5 | 2024-09-25 | West |
| 10 | 1450.0 | 2024-10-10 | South |
+----+--------+------------+--------+
Example 1: Computing the Median
The median is the 0.5 quantile, meaning it divides the data into two equal parts. It’s the number in the middle of a set of numbers.
Let’s compute the median of the amount
column using the approx_quantile()
function:
SELECT
approx_quantile(amount, 0.5) AS median_amount
FROM sales;
Output:
+---------------+
| median_amount |
+---------------+
| 375.15 |
+---------------+
Example 2: Computing the 90th Percentile
The 90th percentile is the value below which 90% of the data falls. Let’s compute the 90th percentile of the amount
column.
SELECT
approx_quantile(amount, 0.9) AS percentile_90
FROM sales;
Output:
+-------------------+
| percentile_90 |
+-------------------+
| 1210.374885737896 |
+-------------------+
Example 3: Computing Multiple Quantiles
You can also compute multiple quantiles in a single query. Let’s compute the 25th, 50th, and 75th percentiles of the amount
column.
SELECT
approx_quantile(amount, 0.25) AS percentile_25,
approx_quantile(amount, 0.5) AS percentile_50,
approx_quantile(amount, 0.75) AS percentile_75
FROM sales;
Output:
+---------------+---------------+---------------+
| percentile_25 | percentile_50 | percentile_75 |
+---------------+---------------+---------------+
| 200.5 | 375.15 | 600.5 |
+---------------+---------------+---------------+
Example 4: Using approx_quantile()
with GROUP BY
The approx_quantile()
function can also be used in conjunction with the GROUP BY
clause to compute quantiles for different groups within the data.
Let’s compute the median amount
for each region
.
SELECT
region,
approx_quantile(amount, 0.5) AS median_amount
FROM sales
GROUP BY region;
Output:
+--------+---------------+
| region | median_amount |
+--------+---------------+
| North | 200.5 |
| South | 250.0 |
| East | 685.375 |
| West | 550.745 |
+--------+---------------+
Dates
We can use approx_quantile()
on dates, timestamps, etc:
SELECT
approx_quantile(order_date, 0.5) AS median_order_date
FROM sales;
Output:
+-------------------+
| median_order_date |
+-------------------+
| 2023-05-27 |
+-------------------+
The Quantile Must be Between 0 and 1
The quantile provided at the second argument must be between 0 and 1. If it’s outside of this range, we’ll get an error.
Example of error:
SELECT
approx_quantile(amount, 2) AS median_amount
FROM sales;
Output:
Binder Error: APPROXIMATE QUANTILE can only take parameters in range [0, 1]
Passing the Wrong Argument Type
Passing an argument of the wrong type results in an error:
SELECT
approx_quantile(region, 0.5) AS median_amount
FROM sales;
Output:
Binder Error: No function matches the given name and argument types 'approx_quantile(VARCHAR, DECIMAL(2,1))'. You might need to add explicit type casts.
Candidate functions:
approx_quantile(DECIMAL, FLOAT) -> DECIMAL
approx_quantile(SMALLINT, FLOAT) -> SMALLINT
approx_quantile(INTEGER, FLOAT) -> INTEGER
approx_quantile(BIGINT, FLOAT) -> BIGINT
approx_quantile(HUGEINT, FLOAT) -> HUGEINT
approx_quantile(DOUBLE, FLOAT) -> DOUBLE
approx_quantile(DATE, FLOAT) -> DATE
approx_quantile(TIME, FLOAT) -> TIME
approx_quantile(TIME WITH TIME ZONE, FLOAT) -> TIME WITH TIME ZONE
approx_quantile(TIMESTAMP, FLOAT) -> TIMESTAMP
approx_quantile(TIMESTAMP WITH TIME ZONE, FLOAT) -> TIMESTAMP WITH TIME ZONE
approx_quantile(DECIMAL, FLOAT[]) -> DECIMAL[]
approx_quantile(TINYINT, FLOAT[]) -> TINYINT[]
approx_quantile(SMALLINT, FLOAT[]) -> SMALLINT[]
approx_quantile(INTEGER, FLOAT[]) -> INTEGER[]
approx_quantile(BIGINT, FLOAT[]) -> BIGINT[]
approx_quantile(HUGEINT, FLOAT[]) -> HUGEINT[]
approx_quantile(FLOAT, FLOAT[]) -> FLOAT[]
approx_quantile(DOUBLE, FLOAT[]) -> DOUBLE[]
approx_quantile(DATE, FLOAT[]) -> DATE[]
approx_quantile(TIME, FLOAT[]) -> TIME[]
approx_quantile(TIME WITH TIME ZONE, FLOAT[]) -> TIME WITH TIME ZONE[]
approx_quantile(TIMESTAMP, FLOAT[]) -> TIMESTAMP[]
approx_quantile(TIMESTAMP WITH TIME ZONE, FLOAT[]) -> TIMESTAMP WITH TIME ZONE[]
LINE 2: approx_quantile(region, 0.5) AS...
^
We can see that the error message provides us with some guidance as to which data types are supported.