APPROX_QUANTILE() Examples in DuckDB

DuckDB is an in-process SQL OLAP database management system designed for analytical workloads. One of its handy features is the ability to compute approximate quantiles efficiently using the approx_quantile() function. This function is particularly useful when working with large datasets where exact quantile computation would be computationally expensive.

In this article, we will explore how the approx_quantile() function works, its syntax, and provide examples to demonstrate its usage.

What is the approx_quantile() Function?

The approx_quantile() function in DuckDB computes an approximate quantile of a given column.

Quantiles are points that divide a dataset into equal-sized intervals. For example, the median is the 0.5 quantile, meaning it divides the data into two equal parts. The approx_quantile() function is useful when you need to estimate quantiles quickly, especially in large datasets, without the overhead of exact computation.

Syntax

The syntax for the approx_quantile() function is as follows:

approx_quantile(column, quantile)
  • column: The column for which you want to compute the quantile.
  • quantile: The quantile value you want to compute, which must be between 0 and 1.

Sample Data

Before we dive into examples, let’s create a table and populate it with some data. We’ll use this table to demonstrate how the approx_quantile() function works:

-- Create the table
CREATE TABLE sales (
    id INTEGER,
    amount DOUBLE,
    order_date DATE,
    region TEXT
);

-- Insert sample data
INSERT INTO sales (id, amount, order_date, region) VALUES
(1, 200.50, '2021-01-01', 'North'),
(2, 50.00, '2022-02-15', 'South'),
(3, 150.33, '2023-03-10', 'North'),
(4, 970.75, '2023-04-22', 'East'),
(5, 250.00, '2023-05-05', 'South'),
(6, 400.00, '2023-06-18', 'East'),
(7, 350.30, '2023-07-30', 'North'),
(8, 500.99, '2023-08-12', 'West'),
(9, 600.50, '2024-09-25', 'West'),
(10, 1450.00, '2024-10-10', 'South');

-- Return all data
SELECT * FROM sales;

Output:

+----+--------+------------+--------+
| id | amount | order_date | region |
+----+--------+------------+--------+
| 1 | 200.5 | 2021-01-01 | North |
| 2 | 50.0 | 2022-02-15 | South |
| 3 | 150.33 | 2023-03-10 | North |
| 4 | 970.75 | 2023-04-22 | East |
| 5 | 250.0 | 2023-05-05 | South |
| 6 | 400.0 | 2023-06-18 | East |
| 7 | 350.3 | 2023-07-30 | North |
| 8 | 500.99 | 2023-08-12 | West |
| 9 | 600.5 | 2024-09-25 | West |
| 10 | 1450.0 | 2024-10-10 | South |
+----+--------+------------+--------+

Example 1: Computing the Median

The median is the 0.5 quantile, meaning it divides the data into two equal parts. It’s the number in the middle of a set of numbers.

Let’s compute the median of the amount column using the approx_quantile() function:

SELECT 
    approx_quantile(amount, 0.5) AS median_amount
FROM sales;

Output:

+---------------+
| median_amount |
+---------------+
| 375.15 |
+---------------+

Example 2: Computing the 90th Percentile

The 90th percentile is the value below which 90% of the data falls. Let’s compute the 90th percentile of the amount column.

SELECT 
    approx_quantile(amount, 0.9) AS percentile_90
FROM sales;

Output:

+-------------------+
| percentile_90 |
+-------------------+
| 1210.374885737896 |
+-------------------+

Example 3: Computing Multiple Quantiles

You can also compute multiple quantiles in a single query. Let’s compute the 25th, 50th, and 75th percentiles of the amount column.

SELECT
    approx_quantile(amount, 0.25) AS percentile_25,
    approx_quantile(amount, 0.5) AS percentile_50,
    approx_quantile(amount, 0.75) AS percentile_75
FROM sales;

Output:

+---------------+---------------+---------------+
| percentile_25 | percentile_50 | percentile_75 |
+---------------+---------------+---------------+
| 200.5 | 375.15 | 600.5 |
+---------------+---------------+---------------+

Example 4: Using approx_quantile() with GROUP BY

The approx_quantile() function can also be used in conjunction with the GROUP BY clause to compute quantiles for different groups within the data.

Let’s compute the median amount for each region.

SELECT
    region,
    approx_quantile(amount, 0.5) AS median_amount
FROM sales
GROUP BY region;

Output:

+--------+---------------+
| region | median_amount |
+--------+---------------+
| North | 200.5 |
| South | 250.0 |
| East | 685.375 |
| West | 550.745 |
+--------+---------------+

Dates

We can use approx_quantile() on dates, timestamps, etc:

SELECT 
    approx_quantile(order_date, 0.5) AS median_order_date
FROM sales;

Output:

+-------------------+
| median_order_date |
+-------------------+
| 2023-05-27 |
+-------------------+

The Quantile Must be Between 0 and 1

The quantile provided at the second argument must be between 0 and 1. If it’s outside of this range, we’ll get an error.

Example of error:

SELECT 
    approx_quantile(amount, 2) AS median_amount
FROM sales;

Output:

Binder Error: APPROXIMATE QUANTILE can only take parameters in range [0, 1]

Passing the Wrong Argument Type

Passing an argument of the wrong type results in an error:

SELECT 
    approx_quantile(region, 0.5) AS median_amount
FROM sales;

Output:

Binder Error: No function matches the given name and argument types 'approx_quantile(VARCHAR, DECIMAL(2,1))'. You might need to add explicit type casts.
Candidate functions:
approx_quantile(DECIMAL, FLOAT) -> DECIMAL
approx_quantile(SMALLINT, FLOAT) -> SMALLINT
approx_quantile(INTEGER, FLOAT) -> INTEGER
approx_quantile(BIGINT, FLOAT) -> BIGINT
approx_quantile(HUGEINT, FLOAT) -> HUGEINT
approx_quantile(DOUBLE, FLOAT) -> DOUBLE
approx_quantile(DATE, FLOAT) -> DATE
approx_quantile(TIME, FLOAT) -> TIME
approx_quantile(TIME WITH TIME ZONE, FLOAT) -> TIME WITH TIME ZONE
approx_quantile(TIMESTAMP, FLOAT) -> TIMESTAMP
approx_quantile(TIMESTAMP WITH TIME ZONE, FLOAT) -> TIMESTAMP WITH TIME ZONE
approx_quantile(DECIMAL, FLOAT[]) -> DECIMAL[]
approx_quantile(TINYINT, FLOAT[]) -> TINYINT[]
approx_quantile(SMALLINT, FLOAT[]) -> SMALLINT[]
approx_quantile(INTEGER, FLOAT[]) -> INTEGER[]
approx_quantile(BIGINT, FLOAT[]) -> BIGINT[]
approx_quantile(HUGEINT, FLOAT[]) -> HUGEINT[]
approx_quantile(FLOAT, FLOAT[]) -> FLOAT[]
approx_quantile(DOUBLE, FLOAT[]) -> DOUBLE[]
approx_quantile(DATE, FLOAT[]) -> DATE[]
approx_quantile(TIME, FLOAT[]) -> TIME[]
approx_quantile(TIME WITH TIME ZONE, FLOAT[]) -> TIME WITH TIME ZONE[]
approx_quantile(TIMESTAMP, FLOAT[]) -> TIMESTAMP[]
approx_quantile(TIMESTAMP WITH TIME ZONE, FLOAT[]) -> TIMESTAMP WITH TIME ZONE[]

LINE 2: approx_quantile(region, 0.5) AS...
^

We can see that the error message provides us with some guidance as to which data types are supported.