Exploring the COUNT() Function in DuckDB

In SQL databases, COUNT() is a commonly used aggregation function that returns the number of rows in a group. In this article, I run some examples of the COUNT() function in DuckDB. DuckDB is a high-performance analytical database system that’s designed to be fast, reliable, portable, and easy to use.

Syntax

We can use the following variations when using the COUNT() function:

count()
count(*)
count(arg)

The first two return the number of rows in a group. They’re basically synonyms – they do the same thing.

The third variation returns the number of non-null values in arg.

Basic COUNT() Usage

Let’s start by creating some sample data to work with:

CREATE TABLE customers (
    customer_id INTEGER,
    name VARCHAR,
    country VARCHAR,
    age INTEGER
);

INSERT INTO customers VALUES
    (1, 'Jacy Brittle', 'USA', 35),
    (2, 'Sofia Garcia', 'Spain', 28),
    (3, 'Wei Li', 'China', 42),
    (4, 'Sal Chen', 'USA', 31),
    (5, 'Ahmed Hassan', 'Egypt', 39);

Let’s now use the COUNT() function to get the number of customers:

SELECT COUNT() AS total_customers
FROM customers;

Result:

+-----------------+
| total_customers |
+-----------------+
| 5 |
+-----------------+

As mentioned, we have the option of using the asterisk wildcard (*). The function produces the same results whether we use it or not:

SELECT 
    COUNT(),
    COUNT(*)    
FROM customers;

Result:

+--------------+--------------+
| count_star() | count_star() |
+--------------+--------------+
| 5 | 5 |
+--------------+--------------+

We can see that both variations returned the same result. Interestingly, the result uses count_star() as the column name for both variations.

The COUNT(arg) Variation

We can also specify column names when using the COUNT() function. When we do this, the function returns the number of non-null values in the column.

Let’s create another table:

-- Create a table with some NULL values
CREATE TABLE orders (
    order_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL,
    discount DECIMAL
);

INSERT INTO orders VALUES
    (1, 1, 100.00, NULL),
    (2, 1, 75.50, 10.00),
    (3, 2, 200.00, 25.00),
    (4, 3, 150.00, NULL),
    (5, 2, 300.00, 50.00);

Now let’s find out how many orders have a discount:

SELECT 
    COUNT(discount) AS orders_with_discount
FROM orders;

Result:

+----------------------+
| orders_with_discount |
+----------------------+
| 3 |
+----------------------+

This query simply counted up the non-null values in the discount column, which tells us that three orders received a discount.

The DISTINCT Clause

The DISTINCT clause is a popular clause in SQL that allows us to narrow the results down to just the unique values. DuckDB supports this clause, and so we can go ahead and use it with the COUNT() function to get the number of unique customers in the table.

Example:

SELECT 
    COUNT(customer_id) AS all_customer_rows,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;

Result:

+-------------------+------------------+
| all_customer_rows | unique_customers |
+-------------------+------------------+
| 5 | 3 |
+-------------------+------------------+

We can see that there are five rows with (non-null) customer_id values, but only three of those are unique (due to some customer_ids being listed more than once). The reason these customer_ids are listed multiple times is because some customers have more than one order. The DISTINCT clause provides a quick and easy way to “de-dupe” these customer_id rows.

COUNT() with Filtering

We can use COUNT() to count the number of rows that would be returned from a filtered query.

Suppose we have a transactions table:

CREATE TABLE transactions (
    transaction_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL,
    transaction_date DATE
);

INSERT INTO transactions VALUES
    (1, 1, 100.00, '2024-01-01'),
    (2, 1, 75.50, '2024-01-02'),
    (3, 2, 200.00, '2024-01-02'),
    (4, 3, 150.00, '2024-01-03'),
    (5, 2, 300.00, '2024-01-03'),
    (6, 1, 50.00, '2024-01-03'),
    (7, 4, 125.00, '2024-01-04');

The next two examples run filtered queries against this table.

COUNT() with a WHERE Clause

This query uses a WHERE clause to filter it to just those rows where the amount of the transaction is over 100:

SELECT 
    COUNT(*) AS high_value_transactions
FROM transactions
WHERE amount > 100;

Result:

+-------------------------+
| high_value_transactions |
+-------------------------+
| 4 |
+-------------------------+

COUNT() with a HAVING Clause

And here’s an example with a HAVING clause:

SELECT 
    customer_id,
    COUNT(*) AS transaction_count
FROM transactions
GROUP BY customer_id
HAVING COUNT(*) > 1;

Result:

+-------------+-------------------+
| customer_id | transaction_count |
+-------------+-------------------+
| 1 | 3 |
| 2 | 2 |
+-------------+-------------------+

This query returns a list of customers who have more than one transaction, along with their respective transaction count.

COUNT() with GROUP BY

Let’s create and populate a products table:

-- Create a products table
CREATE TABLE products (
    product_id INTEGER,
    category VARCHAR,
    supplier_id INTEGER,
    price DECIMAL
);

INSERT INTO products VALUES
    (1, 'Electronics', 1, 499.99),
    (2, 'Electronics', 2, 299.99),
    (3, 'Clothing', 1, 49.99),
    (4, 'Clothing', 3, 79.99),
    (5, 'Electronics', 2, 799.99),
    (6, 'Furniture', 3, 599.99);

Here’s an example of using COUNT() in a query with a GROUP BY clause:

SELECT 
    category,
    COUNT(*) AS product_count,
    COUNT(*) FILTER (WHERE price > 400) AS expensive_products
FROM products
GROUP BY category
ORDER BY category;

Result:

+-------------+---------------+--------------------+
| category | product_count | expensive_products |
+-------------+---------------+--------------------+
| Clothing | 2 | 0 |
| Electronics | 3 | 2 |
| Furniture | 1 | 1 |
+-------------+---------------+--------------------+

And here’s one that includes the supplier ID:

SELECT 
    category,
    supplier_id,
    COUNT(*) AS product_count,
    COUNT(*) FILTER (WHERE price > 500) AS expensive_products
FROM products
GROUP BY category, supplier_id
ORDER BY category, supplier_id;

Result:

+-------------+-------------+---------------+--------------------+
| category | supplier_id | product_count | expensive_products |
+-------------+-------------+---------------+--------------------+
| Clothing | 1 | 1 | 0 |
| Clothing | 3 | 1 | 0 |
| Electronics | 1 | 1 | 0 |
| Electronics | 2 | 2 | 1 |
| Furniture | 3 | 1 | 1 |
+-------------+-------------+---------------+--------------------+

A More Comprehensive Analysis

Suppose we have a sales table:

-- Create a sales table
CREATE TABLE sales (
    sale_id INTEGER,
    product_id INTEGER,
    customer_id INTEGER,
    sale_date DATE,
    quantity INTEGER,
    status VARCHAR
);

INSERT INTO sales VALUES
    (1, 1, 1, '2024-01-01', 2, 'Completed'),
    (2, 2, 2, '2024-01-01', 1, 'Pending'),
    (3, 1, 3, '2024-01-02', 3, 'Completed'),
    (4, 3, 1, '2024-01-02', 1, 'Cancelled'),
    (5, 2, 4, '2024-01-03', 2, 'Completed');

Here’s a query that combines multiple COUNT() operations to provide a comprehensive analysis:

SELECT 
    sale_date,
    COUNT(*) AS total_sales,
    COUNT(DISTINCT customer_id) AS unique_customers,
    COUNT(*) FILTER (WHERE status = 'Completed') AS completed_sales,
    COUNT(*) FILTER (WHERE status = 'Cancelled') AS cancelled_sales,
    COUNT(*) FILTER (WHERE quantity > 1) AS multi_item_sales
FROM sales
GROUP BY sale_date
ORDER BY sale_date;

Result:

+------------+-------------+------------------+-----------------+-----------------+------------------+
| sale_date | total_sales | unique_customers | completed_sales | cancelled_sales | multi_item_sales |
+------------+-------------+------------------+-----------------+-----------------+------------------+
| 2024-01-01 | 2 | 2 | 1 | 0 | 1 |
| 2024-01-02 | 2 | 2 | 1 | 1 | 1 |
| 2024-01-03 | 1 | 1 | 1 | 0 | 1 |
+------------+-------------+------------------+-----------------+-----------------+------------------+