Understanding DuckDB’s APPROX_COUNT_DISTINCT() Function

DuckDB is an in-process SQL OLAP database management system designed for fast analytical queries. One of its handy features is the approx_count_distinct() function, which provides an approximate count of distinct values in a column. This function is particularly useful when working with large datasets where an exact count would be computationally expensive.

In this article, we’ll explore how approx_count_distinct() works, its benefits, and how to use it with some simple examples.

What is approx_count_distinct()?

In DuckDB, approx_count_distinct() is an aggregation function that estimates the number of distinct values in a column. Unlike the exact count(DISTINCT column_name), which calculates the precise number of unique values, approx_count_distinct() uses probabilistic algorithms to provide an approximate result. This approximation is often sufficient for analytical purposes and can significantly reduce query execution time.

The function is based on the HyperLogLog algorithm, which is a widely used probabilistic data structure for estimating the cardinality of large datasets. The trade-off for using this approximation is a small margin of error, but the performance gains can be substantial, especially when dealing with massive datasets.

Benefits of Using approx_count_distinct()

The approx_count_distinct() function provides several benefits over count(DISTINCT). In particular:

  • Performance: The approx_count_distinct() function is much faster than count(DISTINCT) for large datasets because it avoids the need to track every unique value explicitly.
  • Scalability: It is well-suited for big data scenarios where exact counts are computationally expensive or impractical.
  • Memory Efficiency: The HyperLogLog algorithm uses a fixed amount of memory, making it memory-efficient even for datasets with billions of distinct values.

Example

To demonstrate the usage of approx_count_distinct(), let’s create a sample table and populate it with data. We’ll then run a query to compare the results of approx_count_distinct() with the exact count(DISTINCT).

Step 1: Create and Populate a Sample Table

First, let’s create a table named sales that stores sales transactions. Each transaction includes a unique transaction ID, a product ID, and the quantity sold.

-- Create the sales table
CREATE TABLE sales (
    transaction_id INT,
    product_id INT,
    quantity INT
);

-- Populate the sales table with 100,000 rows
INSERT INTO sales (transaction_id, product_id, quantity)
SELECT
    row_number() OVER () AS transaction_id,
    floor(random() * 100000) + 1 AS product_id,
    floor(random() * 10) + 1 AS quantity
FROM generate_series(1, 100000);

Step 2: Compare approx_count_distinct() with count(DISTINCT)

Now that we have our sample data, let’s compare the results of approx_count_distinct() and count(DISTINCT), along with count() without DISTINCT:

SELECT 
    count(product_id) AS total_count,
    count(DISTINCT product_id) AS exact_distinct_count,
    approx_count_distinct(product_id) AS approx_distinct_count
FROM sales;

Result:

+-------------+----------------------+-----------------------+
| total_count | exact_distinct_count | approx_distinct_count |
+-------------+----------------------+-----------------------+
| 100000 | 63128 | 62372 |
+-------------+----------------------+-----------------------+

We got a different result for each column.

  • The total_count column provides us with a full count of all rows, including distinct and non-distinct values. So we would expect it to return 100,000, as that’s how many rows we inserted into the table.
  • The exact_distinct_count column returns an exact count of the distinct values.
  • The approx_distinct_count column returns an approximate count of the distinct values. The amount we got here is slightly different to the amount we got in the exact_distinct_count column, but it’s not far off.

If the table contained millions or billions of rows, we could reasonably expect approx_distinct_count() to perform much better than count(DISTINCT).

When to Use approx_count_distinct()

The approx_count_distinct() function is ideal in the following scenarios:

  • Large Datasets: When working with datasets containing millions or billions of rows, the performance benefits of approx_count_distinct() could often outweigh the small margin of error.
  • Exploratory Analysis: During initial data exploration, an approximate count is often sufficient to understand the data distribution.
  • Real-Time Analytics: In real-time or near-real-time analytics, where speed is critical, approx_count_distinct() provides quick insights without the computational overhead of exact counts.

Conclusion

The approx_count_distinct() function in DuckDB is a handy tool for estimating the number of distinct values in a column. By leveraging the HyperLogLog algorithm, it provides a fast and memory-efficient way to approximate distinct counts, making it an excellent choice for large-scale data analysis. While it introduces a small margin of error, the performance gains are often worth the trade-off, especially in big data scenarios.