Understanding the JSON_GROUP_STRUCTURE() Function in DuckDB

The json_group_structure() function in DuckDB is an aggregate function that inspects all JSON values within a group and returns a JSON representation of their structure. It essentially infers a “schema” for the JSON objects in that group. This can be useful for understanding the shape and consistency of your JSON data.

Syntax

The basic syntax goes like this:

json_group_structure(json)

This is an aggregate function that’s typically used in queries with a GROUP BY clause.

Example 1

Let’s illustrate with some examples.

We’ll start by creating and populating a table called items:

CREATE TABLE items (
    id INTEGER,
    category VARCHAR,
    data JSON
);

INSERT INTO items VALUES
    (1, 'A', '{"name": "apple", "qty": 10}'),
    (2, 'A', '{"name": "banana", "qty": 20, "fresh": true}'),
    (3, 'B', '{"name": "carrot", "color": "orange"}'),
    (4, 'A', '{"name": "orange", "qty": 15, "organic": true}'),
    (5, 'B', '{"name": "broccoli", "color": "green", "qty": 5}');

One way to use json_group_structure() is apply it to the whole dataset:

SELECT json_group_structure(data) FROM items;

Result:

{"name":"VARCHAR","qty":"UBIGINT","fresh":"BOOLEAN","color":"VARCHAR","organic":"BOOLEAN"}

We can see that it returned a JSON structure that’s inclusive of all the JSON documents. For example, not all documents have a fresh or color field, but those fields are included in the result.

Another way to use the function is with a GROUP BY query:

SELECT
    category,
    json_group_structure(data) AS structure
FROM items
GROUP BY category
ORDER BY category;

Result:

+----------+--------------------------------------------------------------------------+
| category | structure |
+----------+--------------------------------------------------------------------------+
| A | {"name":"VARCHAR","qty":"UBIGINT","fresh":"BOOLEAN","organic":"BOOLEAN"} |
| B | {"name":"VARCHAR","color":"VARCHAR","qty":"UBIGINT"} |
+----------+--------------------------------------------------------------------------+

This time we get a different structure, depending on which group each JSON document belongs to.

Example 2

Suppose we have a table named events with a column payload containing JSON data:

CREATE TABLE events (
    event_type VARCHAR,
    payload JSON
);

INSERT INTO events VALUES
    ('login', '{"user_id": 1, "timestamp": "2024-05-23T10:00:00Z", "ip_address": "192.168.1.1"}'),
    ('login', '{"user_id": 2, "timestamp": "2024-05-23T10:05:00Z", "ip_address": "192.168.1.2", "session_id": "abc"}'),
    ('click', '{"element_id": "button_A", "timestamp": "2024-05-23T10:02:00Z", "user_id": 1}'),
    ('click', '{"element_id": "link_B", "timestamp": "2024-05-23T10:07:00Z", "user_id": 2, "target_url": "https://example.com"}'),
    ('purchase', '{"item_id": 101, "quantity": 1, "user_id": 1, "price": 29.99, "details": {"category": "electronics", "rating": 5}}'),
    ('purchase', '{"item_id": 202, "quantity": 2, "user_id": 3, "price": 15.00, "details": {"category": "books"}}');

Let’s find the JSON structure for each event_type:

SELECT
    event_type,
    json_group_structure(payload) AS structure
FROM events
GROUP BY event_type
ORDER BY event_type;

Output:

+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| event_type | structure |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| click | {"element_id":"VARCHAR","timestamp":"VARCHAR","user_id":"UBIGINT","target_url":"VARCHAR"} |
| login | {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR","session_id":"VARCHAR"} |
| purchase | {"item_id":"UBIGINT","quantity":"UBIGINT","user_id":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR","rating":"UBIGINT"}} |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+

We can remove the GROUP BY clause if we want to see the structure across all events, regardless of type:

SELECT json_group_structure(payload) AS overall_structure
FROM events;

Result:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| overall_structure |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR","session_id":"VARCHAR","element_id":"VARCHAR","target_url":"VARCHAR","item_id":"UBIGINT","quantity":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR","rating":"UBIGINT"}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Use Cases

The json_group_structure() function could be useful in a variety of scenarios. For example:

  • Data Exploration: Quickly understand the schema of JSON data that might not have a predefined or enforced structure.
  • Data Quality Checks: Identify inconsistencies in your JSON data (e.g., a field that is sometimes a string and sometimes an integer).
  • Schema Inference: If you’re planning to flatten JSON data into relational columns, this function can help you decide on the column names and data types.
  • Debugging ETL Pipelines: When processing JSON data, if you encounter errors, json_group_structure() can help pinpoint structural issues.

Non-Aggregate Version: json_structure()

DuckDB also provides a similar function called json_structure(). This function returns the structure of a JSON document, but it’s not an aggregate function. Therefore, it will output the JSON structure of the documents, row by row. Here’s an example of what I mean:

SELECT 
    event_type, 
    json_structure(payload) AS json_structure
FROM events;

Result:

+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| event_type | json_structure |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+
| login | {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR"} |
| login | {"user_id":"UBIGINT","timestamp":"VARCHAR","ip_address":"VARCHAR","session_id":"VARCHAR"} |
| click | {"element_id":"VARCHAR","timestamp":"VARCHAR","user_id":"UBIGINT"} |
| click | {"element_id":"VARCHAR","timestamp":"VARCHAR","user_id":"UBIGINT","target_url":"VARCHAR"} |
| purchase | {"item_id":"UBIGINT","quantity":"UBIGINT","user_id":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR","rating":"UBIGINT"}} |
| purchase | {"item_id":"UBIGINT","quantity":"UBIGINT","user_id":"UBIGINT","price":"DOUBLE","details":{"category":"VARCHAR"}} |
+------------+-------------------------------------------------------------------------------------------------------------------------------------+