The SQL DISTINCT
clause allows us to remove redundant duplicate rows from our query results. For example if two rows contain the same values, then only one of them is returned.
But how does DISTINCT
handle NULL values?
Let’s find out.
The Definition of Distinct
ISO/IEC 9075:1992 defines distinct as follows:
Two values are said to be not distinct if either: both are the null value, or they compare equal according to Subclause 8.2, “<comparison predicate>”. Otherwise they are distinct. Two rows (or partial rows) are distinct if at least one of their pairs of respective values is distinct. Otherwise they are not distinct. The result of evaluating whether or not two values or two rows are distinct is never unknown.
Therefore, two NULL values are not distinct. Or put another way, two NULL values are treated as duplicates.
So basically, NULL values are treated like any other value when it comes to evaluating them for distinctness.
The DISTINCT
clause removes redundant duplicates. The above SQL standard defines redundant duplicates as follows:
All except one of any multiset of duplicate values or rows.
So if we’re returning just one column, and that column contains NULL
in two or more rows, only one row will be returned.
Example
Suppose we have the following table:
SELECT * FROM Dogs;
Result:
+-------+-----------+----------+------+ | DogId | FirstName | LastName | Age | +-------+-----------+----------+------+ | 1 | Bark | Smith | 7 | | 2 | Wag | Conner | NULL | | 3 | Bark | NULL | NULL | | 4 | Bark | Jones | NULL | | 5 | Fetch | Fraser | 10 | | 6 | Bite | Tohill | 10 | +-------+-----------+----------+------+
We can see that the Age
column has NULL
in three rows. Therefore, if we do a SELECT DISTINCT
on just that column, we would expect only one of the NULL
rows to be returned.
Let’s see:
SELECT DISTINCT Age FROM Dogs;
Result:
+------+ | Age | +------+ | 7 | | NULL | | 10 | +------+
As expected, only one row contains NULL
. Likewise for the value of 10
(which is present in two rows in the underlying table).
Multiple Columns
When we include more columns in our SELECT
list, DISTINCT
will also evaluate those columns. Therefore, it is still possible to get multiple NULL
s returned, depending on what other columns are selected.
For example:
SELECT DISTINCT
FirstName,
Age
FROM Dogs;
Result:
+-----------+------+ | FirstName | Age | +-----------+------+ | Bark | 7 | | Wag | NULL | | Bark | NULL | | Fetch | 10 | | Bite | 10 | +-----------+------+
This time two NULL
s are returned. That’s because we’ve added the FirstName
column to the SELECT
list, which means that both columns need to be evaluated for their distinctness (and only distinct rows are returned).
Using DISTINCT
with Aggregate Functions
You’ll need to be aware of the following if you ever use DISTINCT
with an aggregate function.
Even though the DISTINCT
clause returns at least one row that’s NULL
, we might get a different result if we use it with an aggregate function.
Example:
SELECT COUNT(DISTINCT Age) FROM Dogs;
Result:
+---------------------+ | COUNT(DISTINCT Age) | +---------------------+ | 2 | +---------------------+
Here, the COUNT()
function tells us that there are only two distinct rows, even though we saw three from an earlier example.
Just to be sure, here it is without COUNT()
:
SELECT DISTINCT Age FROM Dogs;
Result:
+------+ | Age | +------+ | 7 | | NULL | | 10 | +------+ 3 rows in set (0.00 sec)
The reason for the discrepancy is that aggregate functions like COUNT(column_name)
eliminate NULL values when evaluating their argument. In other words, COUNT(column_name)
only counts up the non-NULL
values.
Bear in mind, if we use COUNT(*)
then NULL values may be included in the total.
To demonstrate this, suppose we create a table with a single column and populate that column with NULL
in all rows:
CREATE TABLE t1 (
c1 VARCHAR(50)
);
INSERT INTO t1 VALUES (NULL);
INSERT INTO t1 VALUES (NULL);
INSERT INTO t1 VALUES (NULL);
Now let’s run a query with COUNT(column_name)
against that table:
SELECT COUNT(c1) FROM t1;
Result:
+-----------+ | COUNT(c1) | +-----------+ | 0 | +-----------+
And now let’s use COUNT(*)
instead:
SELECT COUNT(*) FROM t1;
Result:
+----------+ | COUNT(*) | +----------+ | 3 | +----------+
We got different results depending on whether we passed the column name or the asterisk (*
) wildcard.