A Quick Overview of UNION DISTINCT in SQL

If you’ve ever used the UNION clause in a SQL query, chances are you used UNION ALL or just UNION by itself. But some RDBMSs also accept a UNION DISTINCT option.

The UNION DISTINCT option is basically the equivalent of UNION by itself. That is, it removes redundant duplicate rows.

So in other words, the DISTINCT option is a way for us to explicitly specify the default behaviour of removing duplicates.

RDBMSs that support the UNION DISTINCT option include PostgreSQL, MySQL, and MariaDB.

Example

Suppose we run the following query:

SELECT CatName FROM Cats;

Result:

 catname 
---------
 Fluffy
 Purr
 Meow
 Scratch
(4 rows)

And then we run a second query, against a different table, like this:

SELECT DogName FROM Dogs;

Result:

 dogname 
---------
 Bark
 Wag
 Bark
 Fluffy
(4 rows)

We just ran two queries; the first one against the Cats table and the second query against the Dogs table.

We can use UNION DISTINCT to combine the results of both queries into a single result set, while eliminating redundant duplicate rows.

Like this:

SELECT CatName FROM Cats
UNION DISTINCT
SELECT DogName FROM Dogs;

Result:

 catname 
---------
 Meow
 Wag
 Bark
 Scratch
 Fluffy
 Purr
(6 rows)

We can see that duplicate rows have been removed; Fluffy appears just once and so does Bark.

To be more specific, in the original tables the name Fluffy appears once in the Cats table and once in the Dogs table. But it appears just once in the combined results after applying UNION DISTINCT.

And the name Bark appears twice in the Dogs table but just once in the combined results.

Omitting the DISTINCT Keyword

When we use the DISTINCT keyword, we’re explicitly removing duplicates. This produces the same effect that we’d get if we’d simply omitted the DISTINCT keyword.

In other words, we can use UNION to achieve the same result:

SELECT CatName FROM Cats
UNION
SELECT DogName FROM Dogs;

Result:

 catname 
---------
 Meow
 Wag
 Bark
 Scratch
 Fluffy
 Purr
(6 rows)

The default behaviour of UNION is to remove duplicates. So when we use UNION without the DISTINCT (or ALL) keyword, redundant duplicates are automatically removed.

So if your RDBMS doesn’t support the DISTINCT keyword, you’re not missing out on anything, other than the ability to remove duplicates explicitly rather than implicitly.

The ALL Keyword

If you don’t want to remove duplicates, use the ALL keyword:

SELECT CatName FROM Cats
UNION ALL
SELECT DogName FROM Dogs;

Result:

 catname 
---------
 Fluffy
 Purr
 Meow
 Scratch
 Bark
 Wag
 Bark
 Fluffy
(8 rows)

It’s likely that your RDBMS supports the ALL keyword, even if it doesn’t support the DISTINCT keyword. As far as I’m aware, any RDBMS that supports the UNION clause supports the ALL option.

The UNION DISTINCT Option vs DISTINCT Clause

We also have the option of using a DISTINCT clause in our SELECT queries. This is not necessarily the same as using the DISTINCT option with the UNION ALL clause.

For example, we could use the DISTINCT clause against each table, like this:

SELECT DISTINCT CatName FROM Cats
UNION ALL
SELECT DISTINCT DogName FROM Dogs;

Result:

 catname 
---------
 Fluffy
 Meow
 Purr
 Scratch
 Fluffy
 Bark
 Wag
(7 rows)

But this only removes duplicates from each table before applying UNION ALL.

In this case we do actually get duplicate values. We get two cases of Fluffy; one from the Cats table and one from the Dogs table.

Basically, applying the DISTINCT clause like this merely ensures that we eliminate duplicates from each table. But it doesn’t ensure that duplicates are removed from the combined result set.

However, we could use a sub-query to mimic the effect of UNION DISTINCT, while applying UNION ALL:

SELECT DISTINCT * FROM (
    SELECT CatName FROM Cats
    UNION ALL
    SELECT DogName FROM Dogs
) AS Result;

Result:

 catname 
---------
 Fluffy
 Meow
 Bark
 Purr
 Scratch
 Wag
(6 rows)

This is basically a long winded way to do a simple UNION or UNION DISTINCT. However, it could be useful to look at the query plan to see how each variation affects the performance. The result may vary from query to query, and so it’s at least nice to have this option up your sleeve.