array_agg()
is an aggregate function in DuckDB that allows you to combine values from multiple rows into a list. This article explores how array_agg()
works, along with some examples that demonstrate its usage.
Syntax
The syntax is simple:
array_agg(arg)
Where arg
is the name of the column that contains the data you want to put into a list.
The function also has an alias called list()
, so the following syntax is valid:
list(arg)
Both forms return the same result.
Basic Functionality
At its core, array_agg()
collects values from multiple rows and combines them into a single list. Unlike simple aggregations like sum()
or avg()
that return a single scalar value, array_agg()
preserves all the individual values in their original form.
Despite its name, array_agg()
actually returns a LIST
data type. In DuckDB, a list is similar to an array. The difference is that a list can contain any number of elements, whereas an array has a fixed number of elements. According to the DuckDB documentation, DuckDB provides array...
functions (such as array_agg()
) for PostgreSQL compatibility. In PostgreSQL, arrays can be variable length whereas in DuckDB they are a fixed length.
In other words, DuckDB lists are a variable length version of arrays. However, all elements in a list or array must be of the same underlying type.
Example
Suppose we have the following data:
CREATE TABLE sales (
product_id INT,
category VARCHAR,
sale_amount DECIMAL
);
INSERT INTO sales VALUES
(1, 'Electronics', 100),
(2, 'Electronics', 200),
(3, 'Clothing', 50),
(4, 'Electronics', 150),
(5, 'Electronics', NULL),
(6, 'Clothing', 120);
SELECT * from sales;
Output:
+------------+-------------+-------------+
| product_id | category | sale_amount |
+------------+-------------+-------------+
| 1 | Electronics | 100.000 |
| 2 | Electronics | 200.000 |
| 3 | Clothing | 50.000 |
| 4 | Electronics | 150.000 |
| 5 | Electronics | null |
| 6 | Clothing | 120.000 |
+------------+-------------+-------------+
We can use array_agg()
to combine all the sale amounts into a list:
SELECT
array_agg(sale_amount) AS sales_list
FROM sales;
This returns:
+----------------------------------------------------+
| sales_list |
+----------------------------------------------------+
| [100.000, 200.000, 50.000, 150.000, NULL, 120.000] |
+----------------------------------------------------+
Grouping
We can use a GROUP BY
clause to break it down by category:
SELECT
category,
array_agg(sale_amount) AS sales_list
FROM sales
GROUP BY category;
Result:
+-------------+-----------------------------------+
| category | sales_list |
+-------------+-----------------------------------+
| Electronics | [100.000, 200.000, 150.000, NULL] |
| Clothing | [50.000, 120.000] |
+-------------+-----------------------------------+
Ordering
We can use the ORDER BY
clause within array_agg()
to sort the results within each list:
SELECT
category,
array_agg(sale_amount ORDER BY sale_amount DESC) AS sorted_sales
FROM sales
GROUP BY category;
Result:
+-------------+-----------------------------------+
| category | sorted_sales |
+-------------+-----------------------------------+
| Electronics | [200.000, 150.000, 100.000, NULL] |
| Clothing | [120.000, 50.000] |
+-------------+-----------------------------------+
To sort the actual rows returned, we can apply an ORDER BY
clause against the whole query:
SELECT
category,
array_agg(sale_amount ORDER BY sale_amount DESC) AS sorted_sales
FROM sales
GROUP BY category
ORDER BY category ASC;
Result:
+-------------+-----------------------------------+
| category | sorted_sales |
+-------------+-----------------------------------+
| Clothing | [120.000, 50.000] |
| Electronics | [200.000, 150.000, 100.000, NULL] |
+-------------+-----------------------------------+
Here, we’ve applied two sets of sorting; one for the individual lists in the sorted_sales
column, and another for the rows returned by the query (they’re sorted by the category
column).
Handling NULL
Values
The array_agg()
includes NULL values in the resulting array by default. Sometimes you might want to filter them out.
One way to do this is with the WHERE
clause against the query:
SELECT
category,
array_agg(sale_amount) AS sales_array
FROM sales
WHERE sale_amount IS NOT NULL
GROUP BY category;
Output:
+-------------+-----------------------------+
| category | sales_array |
+-------------+-----------------------------+
| Electronics | [100.000, 200.000, 150.000] |
| Clothing | [50.000, 120.000] |
+-------------+-----------------------------+
We can alternatively use the FILTER
clause in the array_agg()
column itself:
SELECT
category,
array_agg(sale_amount) FILTER (WHERE sale_amount IS NOT NULL) as sales_array
FROM sales
GROUP BY category;
Output:
+-------------+-----------------------------+
| category | sales_array |
+-------------+-----------------------------+
| Electronics | [100.000, 200.000, 150.000] |
| Clothing | [50.000, 120.000] |
+-------------+-----------------------------+
Multiple Columns in Arrays
We can create arrays of structured data by combining multiple columns:
SELECT
category,
array_agg(struct_pack(
product_id := product_id,
amount := sale_amount
)) as sale_details
FROM sales
GROUP BY category;
Result:
+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| category | sale_details |
+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Clothing | [{'product_id': 3, 'amount': 50.000}, {'product_id': 6, 'amount': 120.000}] |
| Electronics | [{'product_id': 1, 'amount': 100.000}, {'product_id': 2, 'amount': 200.000}, {'product_id': 4, 'amount': 150.000}, {'product_id': 5, 'amount': NULL}] |
+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
Window Function Usage
array_agg()
can be used as a window function to create rolling or cumulative arrays:
SELECT
*,
array_agg(sale_amount)
OVER (PARTITION BY category ORDER BY sale_amount) as cumulative_sales
FROM sales;
Result:
+------------+-------------+-------------+-----------------------------------+
| product_id | category | sale_amount | cumulative_sales |
+------------+-------------+-------------+-----------------------------------+
| 1 | Electronics | 100.000 | [100.000] |
| 4 | Electronics | 150.000 | [100.000, 150.000] |
| 2 | Electronics | 200.000 | [100.000, 150.000, 200.000] |
| 5 | Electronics | null | [100.000, 150.000, 200.000, NULL] |
| 3 | Clothing | 50.000 | [50.000] |
| 6 | Clothing | 120.000 | [50.000, 120.000] |
+------------+-------------+-------------+-----------------------------------+