Performance Tip for Extracting Multiple Values from JSON in DuckDB

DuckDB has a bunch of functions that allow us to extract data from JSON documents. For example, there’s the json_extract() function, which extracts JSON from the specified JSON document.

Often times we’ll need to extract multiple values within the same query. For example, we may need to extract both a user’s name and age, so that they’re returned in two separate columns.

Extracting multiple values from the same JSON document can have a performance hit if we have to make multiple calls to the same function. The DuckDB documentation provides us with a tip for avoiding this performance hit. Let’s walk through an example.

Sample Data

Suppose we create and populate the following table:

CREATE TABLE users (
    user_details JSON
);

INSERT INTO users VALUES
    ( '{"name": "Bruno", "scores": [11, 17, 20] }' ),
    ( '{"name": "Kaylee", "scores": [8, 7, 14]}' ),
    ( '{"name": "Lisa", "scores": [21, 18, 32]}' );

Let’s use the json_extract() function to extract data from that table.

Less-Efficient Approach

First we’ll use the less-efficient method to extract various fields from the JSON document:

SELECT 
    json_extract(user_details, '$.name') AS name,
    json_extract(user_details, '$.scores') AS scores
FROM users;

Output:

+----------+-----+
| name | age |
+----------+-----+
| "Bruno" | 36 |
| "Kaylee" | 20 |
| "Lisa" | 25 |
+----------+-----+

In this case, we used two calls to json_extract() in the SELECT list. According to DuckDB’s documentation, this is not an efficient method, because the JSON has to be parsed multiple times. In this case it will need to be parsed twice; once for each call to json_extract(). This approach uses more memory and can result in a slower query.

A More Efficient Approach

A more efficient approach is to extract a list of paths, as recommended by the aforementioned DuckDB documentation.

So we can modify our query as follows:

WITH extracted AS (
    SELECT json_extract(user_details, ['name', 'scores']) AS extracted_list
    FROM users
)
SELECT
    extracted_list[1] AS name,
    extracted_list[2] AS scores
FROM extracted;

Output:

+----------+------------+
| name | scores |
+----------+------------+
| "Bruno" | [11,17,20] |
| "Kaylee" | [8,7,14] |
| "Lisa" | [21,18,32] |
+----------+------------+

Same result, but faster and more memory-efficient.