If you’ve ever used the UNION
clause in a SQL query, chances are you used UNION ALL
or just UNION
by itself. But some RDBMSs also accept a UNION DISTINCT
option.
The UNION DISTINCT
option is basically the equivalent of UNION
by itself. That is, it removes redundant duplicate rows.
So in other words, the DISTINCT
option is a way for us to explicitly specify the default behaviour of removing duplicates.
RDBMSs that support the UNION DISTINCT
option include PostgreSQL, MySQL, and MariaDB.
Example
Suppose we run the following query:
SELECT CatName FROM Cats;
Result:
catname --------- Fluffy Purr Meow Scratch (4 rows)
And then we run a second query, against a different table, like this:
SELECT DogName FROM Dogs;
Result:
dogname --------- Bark Wag Bark Fluffy (4 rows)
We just ran two queries; the first one against the Cats
table and the second query against the Dogs
table.
We can use UNION DISTINCT
to combine the results of both queries into a single result set, while eliminating redundant duplicate rows.
Like this:
SELECT CatName FROM Cats
UNION DISTINCT
SELECT DogName FROM Dogs;
Result:
catname --------- Meow Wag Bark Scratch Fluffy Purr (6 rows)
We can see that duplicate rows have been removed; Fluffy
appears just once and so does Bark
.
To be more specific, in the original tables the name Fluffy
appears once in the Cats
table and once in the Dogs
table. But it appears just once in the combined results after applying UNION DISTINCT
.
And the name Bark
appears twice in the Dogs
table but just once in the combined results.
Omitting the DISTINCT
Keyword
When we use the DISTINCT
keyword, we’re explicitly removing duplicates. This produces the same effect that we’d get if we’d simply omitted the DISTINCT
keyword.
In other words, we can use UNION
to achieve the same result:
SELECT CatName FROM Cats
UNION
SELECT DogName FROM Dogs;
Result:
catname --------- Meow Wag Bark Scratch Fluffy Purr (6 rows)
The default behaviour of UNION
is to remove duplicates. So when we use UNION
without the DISTINCT
(or ALL
) keyword, redundant duplicates are automatically removed.
So if your RDBMS doesn’t support the DISTINCT
keyword, you’re not missing out on anything, other than the ability to remove duplicates explicitly rather than implicitly.
The ALL
Keyword
If you don’t want to remove duplicates, use the ALL
keyword:
SELECT CatName FROM Cats
UNION ALL
SELECT DogName FROM Dogs;
Result:
catname --------- Fluffy Purr Meow Scratch Bark Wag Bark Fluffy (8 rows)
It’s likely that your RDBMS supports the ALL
keyword, even if it doesn’t support the DISTINCT
keyword. As far as I’m aware, any RDBMS that supports the UNION
clause supports the ALL
option.
The UNION DISTINCT
Option vs DISTINCT
Clause
We also have the option of using a DISTINCT
clause in our SELECT
queries. This is not necessarily the same as using the DISTINCT
option with the UNION ALL
clause.
For example, we could use the DISTINCT
clause against each table, like this:
SELECT DISTINCT CatName FROM Cats
UNION ALL
SELECT DISTINCT DogName FROM Dogs;
Result:
catname --------- Fluffy Meow Purr Scratch Fluffy Bark Wag (7 rows)
But this only removes duplicates from each table before applying UNION ALL
.
In this case we do actually get duplicate values. We get two cases of Fluffy
; one from the Cats
table and one from the Dogs
table.
Basically, applying the DISTINCT
clause like this merely ensures that we eliminate duplicates from each table. But it doesn’t ensure that duplicates are removed from the combined result set.
However, we could use a sub-query to mimic the effect of UNION DISTINCT
, while applying UNION ALL
:
SELECT DISTINCT * FROM (
SELECT CatName FROM Cats
UNION ALL
SELECT DogName FROM Dogs
) AS Result;
Result:
catname --------- Fluffy Meow Bark Purr Scratch Wag (6 rows)
This is basically a long winded way to do a simple UNION
or UNION DISTINCT
. However, it could be useful to look at the query plan to see how each variation affects the performance. The result may vary from query to query, and so it’s at least nice to have this option up your sleeve.