The json_group_structure()
function in DuckDB is an aggregate function that inspects all JSON values within a group and returns a JSON representation of their structure. It essentially infers a “schema” for the JSON objects in that group. This can be useful for understanding the shape and consistency of your JSON data.
Syntax
The basic syntax goes like this:
json_group_structure(json)
This is an aggregate function that’s typically used in queries with a GROUP BY
clause.
Example 1
Let’s illustrate with some examples.
We’ll start by creating and populating a table called items
:
CREATE TABLE items (
id INTEGER,
category VARCHAR,
data JSON
);
INSERT INTO items VALUES
(1, 'A', '{"name": "apple", "qty": 10}'),
(2, 'A', '{"name": "banana", "qty": 20, "fresh": true}'),
(3, 'B', '{"name": "carrot", "color": "orange"}'),
(4, 'A', '{"name": "orange", "qty": 15, "organic": true}'),
(5, 'B', '{"name": "broccoli", "color": "green", "qty": 5}');
One way to use json_group_structure()
is apply it to the whole dataset:
SELECT json_group_structure(data) FROM items;
Result:
{"name":"VARCHAR","qty":"UBIGINT","fresh":"BOOLEAN","color":"VARCHAR","organic":"BOOLEAN"}
We can see that it returned a JSON structure that’s inclusive of all the JSON documents. For example, not all documents have a fresh
or color
field, but those fields are included in the result.
Another way to use the function is with a GROUP BY
query:
SELECT
category,
json_group_structure(data) AS structure
FROM items
GROUP BY category
ORDER BY category;
Result:
+----------+--------------------------------------------------------------------------+
| category | structure |
+----------+--------------------------------------------------------------------------+
| A | {"name":"VARCHAR","qty":"UBIGINT","fresh":"BOOLEAN","organic":"BOOLEAN"} |
| B | {"name":"VARCHAR","color":"VARCHAR","qty":"UBIGINT"} |
+----------+--------------------------------------------------------------------------+
This time we get a different structure, depending on which group each JSON document belongs to.
Example 2
Suppose we have a table named events
with a column payload
containing JSON data:
CREATE TABLE events (
event_type VARCHAR,
payload JSON
);
INSERT INTO events VALUES
('login', '{"user_id": 1, "timestamp": "2024-05-23T10:00:00Z", "ip_address": "192.168.1.1"}'),
('login', '{"user_id": 2, "timestamp": "2024-05-23T10:05:00Z", "ip_address": "192.168.1.2", "session_id": "abc"}'),
('click', '{"element_id": "button_A", "timestamp": "2024-05-23T10:02:00Z", "user_id": 1}'),
('click', '{"element_id": "link_B", "timestamp": "2024-05-23T10:07:00Z", "user_id": 2, "target_url": "https://example.com"}'),
('purchase', '{"item_id": 101, "quantity": 1, "user_id": 1, "price": 29.99, "details": {"category": "electronics", "rating": 5}}'),
('purchase', '{"item_id": 202, "quantity": 2, "user_id": 3, "price": 15.00, "details": {"category": "books"}}');
Let’s find the JSON structure for each event_type
:
SELECT
event_type,
json_group_structure(payload) AS structure
FROM events
GROUP BY event_type
ORDER BY event_type;
Output:
+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| event_type | structure |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| click | {"element_id":"VARCHAR","timestamp":"VARCHAR","user_id":"UBIGINT","target_url":"VARCHAR"} |
| login | {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR","session_id":"VARCHAR"} |
| purchase | {"item_id":"UBIGINT","quantity":"UBIGINT","user_id":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR","rating":"UBIGINT"}} |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+
We can remove the GROUP BY
clause if we want to see the structure across all events, regardless of type:
SELECT json_group_structure(payload) AS overall_structure
FROM events;
Result:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| overall_structure |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR","session_id":"VARCHAR","element_id":"VARCHAR","target_url":"VARCHAR","item_id":"UBIGINT","quantity":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR","rating":"UBIGINT"}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Use Cases
The json_group_structure()
function could be useful in a variety of scenarios. For example:
- Data Exploration: Quickly understand the schema of JSON data that might not have a predefined or enforced structure.
- Data Quality Checks: Identify inconsistencies in your JSON data (e.g., a field that is sometimes a string and sometimes an integer).
- Schema Inference: If you’re planning to flatten JSON data into relational columns, this function can help you decide on the column names and data types.
- Debugging ETL Pipelines: When processing JSON data, if you encounter errors,
json_group_structure()
can help pinpoint structural issues.
Non-Aggregate Version: json_structure()
DuckDB also provides a similar function called json_structure()
. This function returns the structure of a JSON document, but it’s not an aggregate function. Therefore, it will output the JSON structure of the documents, row by row. Here’s an example of what I mean:
SELECT
event_type,
json_structure(payload) AS json_structure
FROM events;
Result:
+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| event_type | json_structure |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| login | {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR"} |
| login | {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR","session_id":"VARCHAR"} |
| click | {"element_id":"VARCHAR","timestamp":"VARCHAR","user_id":"UBIGINT"} |
| click | {"element_id":"VARCHAR","timestamp":"VARCHAR","user_id":"UBIGINT","target_url":"VARCHAR"} |
| purchase | {"item_id":"UBIGINT","quantity":"UBIGINT","user_id":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR","rating":"UBIGINT"}} |
| purchase | {"item_id":"UBIGINT","quantity":"UBIGINT","user_id":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR"}} |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+