How the DISTINCT Clause Handles NULL Values in SQL

The SQL DISTINCT clause allows us to remove redundant duplicate rows from our query results. For example if two rows contain the same values, then only one of them is returned.

But how does DISTINCT handle NULL values?

Let’s find out.

The Definition of Distinct

ISO/IEC 9075:1992 defines distinct as follows:

Two values are said to be not distinct if either: both are the null value, or they compare equal according to Subclause 8.2, “<comparison predicate>”. Otherwise they are distinct. Two rows (or partial rows) are distinct if at least one of their pairs of respective values is distinct. Otherwise they are not distinct. The result of evaluating whether or not two values or two rows are distinct is never unknown.

Therefore, two NULL values are not distinct. Or put another way, two NULL values are treated as duplicates.

So basically, NULL values are treated like any other value when it comes to evaluating them for distinctness.

The DISTINCT clause removes redundant duplicates. The above SQL standard defines redundant duplicates as follows:

All except one of any multiset of duplicate values or rows.

So if we’re returning just one column, and that column contains NULL in two or more rows, only one row will be returned.

Example

Suppose we have the following table:

SELECT * FROM Dogs;

Result:

+-------+-----------+----------+------+
| DogId | FirstName | LastName | Age  |
+-------+-----------+----------+------+
|     1 | Bark      | Smith    |    7 |
|     2 | Wag       | Conner   | NULL |
|     3 | Bark      | NULL     | NULL |
|     4 | Bark      | Jones    | NULL |
|     5 | Fetch     | Fraser   |   10 |
|     6 | Bite      | Tohill   |   10 |
+-------+-----------+----------+------+

We can see that the Age column has NULL in three rows. Therefore, if we do a SELECT DISTINCT on just that column, we would expect only one of the NULL rows to be returned.

Let’s see:

SELECT DISTINCT Age FROM Dogs;

Result:

+------+
| Age  |
+------+
|    7 |
| NULL |
|   10 |
+------+

As expected, only one row contains NULL. Likewise for the value of 10 (which is present in two rows in the underlying table).

Multiple Columns

When we include more columns in our SELECT list, DISTINCT will also evaluate those columns. Therefore, it is still possible to get multiple NULLs returned, depending on what other columns are selected.

For example:

SELECT DISTINCT 
    FirstName,
    Age
FROM Dogs;

Result:

+-----------+------+
| FirstName | Age  |
+-----------+------+
| Bark      |    7 |
| Wag       | NULL |
| Bark      | NULL |
| Fetch     |   10 |
| Bite      |   10 |
+-----------+------+

This time two NULLs are returned. That’s because we’ve added the FirstName column to the SELECT list, which means that both columns need to be evaluated for their distinctness (and only distinct rows are returned).

Using DISTINCT with Aggregate Functions

You’ll need to be aware of the following if you ever use DISTINCT with an aggregate function.

Even though the DISTINCT clause returns at least one row that’s NULL, we might get a different result if we use it with an aggregate function.

Example:

SELECT COUNT(DISTINCT Age) FROM Dogs;

Result:

+---------------------+
| COUNT(DISTINCT Age) |
+---------------------+
|                   2 |
+---------------------+

Here, the COUNT() function tells us that there are only two distinct rows, even though we saw three from an earlier example.

Just to be sure, here it is without COUNT():

SELECT DISTINCT Age FROM Dogs;

Result:

+------+
| Age  |
+------+
|    7 |
| NULL |
|   10 |
+------+
3 rows in set (0.00 sec)

The reason for the discrepancy is that aggregate functions like COUNT(column_name) eliminate NULL values when evaluating their argument. In other words, COUNT(column_name) only counts up the non-NULL values.

Bear in mind, if we use COUNT(*) then NULL values may be included in the total.

To demonstrate this, suppose we create a table with a single column and populate that column with NULL in all rows:

CREATE TABLE t1 (
  c1 VARCHAR(50)
);

INSERT INTO t1 VALUES (NULL);
INSERT INTO t1 VALUES (NULL);
INSERT INTO t1 VALUES (NULL);

Now let’s run a query with COUNT(column_name) against that table:

SELECT COUNT(c1) FROM t1;

Result:

+-----------+
| COUNT(c1) |
+-----------+
|         0 |
+-----------+

And now let’s use COUNT(*) instead:

SELECT COUNT(*) FROM t1;

Result:

+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+

We got different results depending on whether we passed the column name or the asterisk (*) wildcard.