Most DuckDB distributions come with the JSON extension, and this extension is loaded upon first use. That means we can go ahead and run queries against JSON data right out of the box. One common task we’ll face when working with JSON is extracting data from within the JSON documents. This can include extracting scalar values, or extracting nested JSON from within the outer document.
DuckDB provides us with multiple ways to extract such data. The option we use will largely depend on our use case. Either way, here are seven options for extracting data from JSON documents in DuckDB.
Option 1: The json_extract() Function
We can tell by the name that the json_extract() function is purpose-built for extracting data from JSON. The way it works is, we pass the JSON document as the first argument, and the path as the second, and it will return the value from that path.
Example:
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
json_extract(getvariable('json_doc'), '$.name') AS name,
json_extract(getvariable('json_doc'), '$.age') AS age,
json_extract(getvariable('json_doc'), '$.contact') AS contact,
json_extract(getvariable('json_doc'), '$.tags') AS tags;
Result:
+----------------+-----+-------------------------------------------------+-------------------------+
| name | age | contact | tags |
+----------------+-----+-------------------------------------------------+-------------------------+
| "Flex Jackson" | 32 | {"email":"[email protected]","phone":"555-1234"} | ["developer","premium"] |
+----------------+-----+-------------------------------------------------+-------------------------+
The json_extract() function returns its result as JSON. That’s why the value of the name field is surrounded in double quotes; that’s how strings are represented in JSON documents. We can also see that the result includes nested JSON documents (the contact and tags fields).
Option 2: The json_extract_path() Function
The json_extract_path() function is an alias for json_extract(). So we can use it in exactly the same way we used json_extract():
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
json_extract_path(getvariable('json_doc'), '$.name') AS name,
json_extract_path(getvariable('json_doc'), '$.age') AS age,
json_extract_path(getvariable('json_doc'), '$.contact') AS contact,
json_extract_path(getvariable('json_doc'), '$.tags') AS tags;
Result:
+----------------+-----+-------------------------------------------------+-------------------------+
| name | age | contact | tags |
+----------------+-----+-------------------------------------------------+-------------------------+
| "Flex Jackson" | 32 | {"email":"[email protected]","phone":"555-1234"} | ["developer","premium"] |
+----------------+-----+-------------------------------------------------+-------------------------+
Option 3: The -> Operator
The -> operator does the same thing as the json_extract() function (except with a different syntax):
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
getvariable('json_doc') -> '$.name' AS name,
getvariable('json_doc') -> '$.age' AS age,
getvariable('json_doc') -> '$.contact' AS contact,
getvariable('json_doc') -> '$.tags' AS tags;
Result:
+----------------+-----+-------------------------------------------------+-------------------------+
| name | age | contact | tags |
+----------------+-----+-------------------------------------------------+-------------------------+
| "Flex Jackson" | 32 | {"email":"[email protected]","phone":"555-1234"} | ["developer","premium"] |
+----------------+-----+-------------------------------------------------+-------------------------+
Option 4: The json_extract_string() Function
The json_extract_string() function is similar to json_extract(), except that it returns its result as a string (VARCHAR):
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
json_extract_string(getvariable('json_doc'), '$.name') AS name,
json_extract_string(getvariable('json_doc'), '$.age') AS age,
json_extract_string(getvariable('json_doc'), '$.contact') AS contact,
json_extract_string(getvariable('json_doc'), '$.tags') AS tags;
Result:
+--------------+-----+-------------------------------------------------+-------------------------+
| name | age | contact | tags |
+--------------+-----+-------------------------------------------------+-------------------------+
| Flex Jackson | 32 | {"email":"[email protected]","phone":"555-1234"} | ["developer","premium"] |
+--------------+-----+-------------------------------------------------+-------------------------+
The first column provides a clue that the function returns a string instead of JSON. This value in this column is not surrounded in double quotes like it was when we used the json_extract() function. Of course, we can always use the typeof() function to check the return type of any column. Or, if we’re using the DuckDB CLI, we could change the output mode to duckbox (which displays the data type in the column headers).
Option 5: The json_extract_path_text() Function
The json_extract_path_text() function is an alias for json_extract_string(), so we can replace the previous example with the following:
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
json_extract_path_text(getvariable('json_doc'), '$.name') AS name,
json_extract_path_text(getvariable('json_doc'), '$.age') AS age,
json_extract_path_text(getvariable('json_doc'), '$.contact') AS contact,
json_extract_path_text(getvariable('json_doc'), '$.tags') AS tags;
Result:
+--------------+-----+-------------------------------------------------+-------------------------+
| name | age | contact | tags |
+--------------+-----+-------------------------------------------------+-------------------------+
| Flex Jackson | 32 | {"email":"[email protected]","phone":"555-1234"} | ["developer","premium"] |
+--------------+-----+-------------------------------------------------+-------------------------+
Option 6: The ->> Operator
The ->> operator does exactly what json_extract_string() does. So we can replace the previous two examples with the following:
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
getvariable('json_doc') ->> '$.name' AS name,
getvariable('json_doc') ->> '$.age' AS age,
getvariable('json_doc') ->> '$.contact' AS contact,
getvariable('json_doc') ->> '$.tags' AS tags;
Result:
+--------------+-----+-------------------------------------------------+-------------------------+
| name | age | contact | tags |
+--------------+-----+-------------------------------------------------+-------------------------+
| Flex Jackson | 32 | {"email":"[email protected]","phone":"555-1234"} | ["developer","premium"] |
+--------------+-----+-------------------------------------------------+-------------------------+
Option 7: The json_value() Function
The json_value() function returns scalar data from JSON documents. We can use it in a similar way to the other functions listed above, but it will only return scalar data. If a value isn’t scalar, then it returns NULL:
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
json_value(getvariable('json_doc'), '$.name') AS name,
json_value(getvariable('json_doc'), '$.age') AS age,
json_value(getvariable('json_doc'), '$.contact') AS contact,
json_value(getvariable('json_doc'), '$.tags') AS tags;
Result:
+----------------+-----+---------+------+
| name | age | contact | tags |
+----------------+-----+---------+------+
| "Flex Jackson" | 32 | NULL | NULL |
+----------------+-----+---------+------+
The data is returned as JSON, which we can tell by the double quotes around the user’s name here. But we can also see that the last two columns contain NULL values. That’s because the values in those fields are not scalar. Those fields contain JSON, and so that doesn’t work for this function. That’s not to say that we can’t dig deeper into those nested JSON documents:
SET VARIABLE json_doc = '{
"name": "Flex Jackson",
"age": 32,
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"tags": ["developer", "premium"]
}';
SELECT
json_value(getvariable('json_doc'), '$.name') AS name,
json_value(getvariable('json_doc'), '$.age') AS age,
json_value(getvariable('json_doc'), '$.contact.email') AS email,
json_value(getvariable('json_doc'), '$.contact.phone') AS phone,
json_value(getvariable('json_doc'), '$.tags[0]') AS tag_1,
json_value(getvariable('json_doc'), '$.tags[1]') AS tag_2;
Result:
+----------------+-----+--------------------+------------+-------------+-----------+
| name | age | email | phone | tag_1 | tag_2 |
+----------------+-----+--------------------+------------+-------------+-----------+
| "Flex Jackson" | 32 | "[email protected]" | "555-1234" | "developer" | "premium" |
+----------------+-----+--------------------+------------+-------------+-----------+
JSON arrays are zero based, and so that’s why tag_1 was extracted with $.tags[0] and tag_2 was extracted with $.tags[1].