Exploring ARRAY_AGG() in DuckDB

array_agg() is an aggregate function in DuckDB that allows you to combine values from multiple rows into a list. This article explores how array_agg() works, along with some examples that demonstrate its usage.

Syntax

The syntax is simple:

array_agg(arg)

Where arg is the name of the column that contains the data you want to put into a list.

The function also has an alias called list(), so the following syntax is valid:

list(arg)

Both forms return the same result.

Basic Functionality

At its core, array_agg() collects values from multiple rows and combines them into a single list. Unlike simple aggregations like sum() or avg() that return a single scalar value, array_agg() preserves all the individual values in their original form.

Despite its name, array_agg() actually returns a LIST data type. In DuckDB, a list is similar to an array. The difference is that a list can contain any number of elements, whereas an array has a fixed number of elements. According to the DuckDB documentation, DuckDB provides array... functions (such as array_agg()) for PostgreSQL compatibility. In PostgreSQL, arrays can be variable length whereas in DuckDB they are a fixed length.

In other words, DuckDB lists are a variable length version of arrays. However, all elements in a list or array must be of the same underlying type.

Example

Suppose we have the following data:

CREATE TABLE sales (
    product_id INT,
    category VARCHAR,
    sale_amount DECIMAL
);

INSERT INTO sales VALUES 
    (1, 'Electronics', 100),
    (2, 'Electronics', 200),
    (3, 'Clothing', 50),
    (4, 'Electronics', 150),
    (5, 'Electronics', NULL),
    (6, 'Clothing', 120);

SELECT * from sales;

Output:

+------------+-------------+-------------+
| product_id | category | sale_amount |
+------------+-------------+-------------+
| 1 | Electronics | 100.000 |
| 2 | Electronics | 200.000 |
| 3 | Clothing | 50.000 |
| 4 | Electronics | 150.000 |
| 5 | Electronics | null |
| 6 | Clothing | 120.000 |
+------------+-------------+-------------+

We can use array_agg() to combine all the sale amounts into a list:

SELECT 
    array_agg(sale_amount) AS sales_list
FROM sales;

This returns:

+----------------------------------------------------+
| sales_list |
+----------------------------------------------------+
| [100.000, 200.000, 50.000, 150.000, NULL, 120.000] |
+----------------------------------------------------+

Grouping

We can use a GROUP BY clause to break it down by category:

SELECT 
    category,
    array_agg(sale_amount) AS sales_list
FROM sales
GROUP BY category;

Result:

+-------------+-----------------------------------+
| category | sales_list |
+-------------+-----------------------------------+
| Electronics | [100.000, 200.000, 150.000, NULL] |
| Clothing | [50.000, 120.000] |
+-------------+-----------------------------------+

Ordering

We can use the ORDER BY clause within array_agg() to sort the results within each list:

SELECT 
    category,
    array_agg(sale_amount ORDER BY sale_amount DESC) AS sorted_sales
FROM sales
GROUP BY category;

Result:

+-------------+-----------------------------------+
| category | sorted_sales |
+-------------+-----------------------------------+
| Electronics | [200.000, 150.000, 100.000, NULL] |
| Clothing | [120.000, 50.000] |
+-------------+-----------------------------------+

To sort the actual rows returned, we can apply an ORDER BY clause against the whole query:

SELECT 
    category,
    array_agg(sale_amount ORDER BY sale_amount DESC) AS sorted_sales
FROM sales
GROUP BY category
ORDER BY category ASC;

Result:

+-------------+-----------------------------------+
| category | sorted_sales |
+-------------+-----------------------------------+
| Clothing | [120.000, 50.000] |
| Electronics | [200.000, 150.000, 100.000, NULL] |
+-------------+-----------------------------------+

Here, we’ve applied two sets of sorting; one for the individual lists in the sorted_sales column, and another for the rows returned by the query (they’re sorted by the category column).

Handling NULL Values

The array_agg() includes NULL values in the resulting array by default. Sometimes you might want to filter them out.

One way to do this is with the WHERE clause against the query:

SELECT 
    category,
    array_agg(sale_amount) AS sales_array
FROM sales
WHERE sale_amount IS NOT NULL
GROUP BY category;

Output:

+-------------+-----------------------------+
| category | sales_array |
+-------------+-----------------------------+
| Electronics | [100.000, 200.000, 150.000] |
| Clothing | [50.000, 120.000] |
+-------------+-----------------------------+

We can alternatively use the FILTER clause in the array_agg() column itself:

SELECT 
    category,
    array_agg(sale_amount) FILTER (WHERE sale_amount IS NOT NULL) as sales_array
FROM sales
GROUP BY category;

Output:

+-------------+-----------------------------+
| category | sales_array |
+-------------+-----------------------------+
| Electronics | [100.000, 200.000, 150.000] |
| Clothing | [50.000, 120.000] |
+-------------+-----------------------------+

Multiple Columns in Arrays

We can create arrays of structured data by combining multiple columns:

SELECT 
    category,
    array_agg(struct_pack(
        product_id := product_id,
        amount := sale_amount
    )) as sale_details
FROM sales
GROUP BY category;

Result:

+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| category | sale_details |
+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Clothing | [{'product_id': 3, 'amount': 50.000}, {'product_id': 6, 'amount': 120.000}] |
| Electronics | [{'product_id': 1, 'amount': 100.000}, {'product_id': 2, 'amount': 200.000}, {'product_id': 4, 'amount': 150.000}, {'product_id': 5, 'amount': NULL}] |
+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+

Window Function Usage

array_agg() can be used as a window function to create rolling or cumulative arrays:

SELECT 
    *,
    array_agg(sale_amount) 
        OVER (PARTITION BY category ORDER BY sale_amount) as cumulative_sales
FROM sales;

Result:

+------------+-------------+-------------+-----------------------------------+
| product_id | category | sale_amount | cumulative_sales |
+------------+-------------+-------------+-----------------------------------+
| 1 | Electronics | 100.000 | [100.000] |
| 4 | Electronics | 150.000 | [100.000, 150.000] |
| 2 | Electronics | 200.000 | [100.000, 150.000, 200.000] |
| 5 | Electronics | null | [100.000, 150.000, 200.000, NULL] |
| 3 | Clothing | 50.000 | [50.000] |
| 6 | Clothing | 120.000 | [50.000, 120.000] |
+------------+-------------+-------------+-----------------------------------+