Most of the major relational database management systems (RDBMSs) allow for a DISTINCT
clause to be included in our SQL queries.
We use the DISTINCT
keyword to return only unique rows. It eliminates duplicates from the results. If we have two or more rows with exactly the same data, we’ll only see one row in the results.
The DISTINCT
keyword is typically used in the SELECT
list of a query, but some RDBMSs allow for it to be placed in other parts of the query, such as the GROUP BY
clause and the WINDOW
clause.
Below are some basic examples that demonstrate how the SQL DISTINCT
clause works.
Sample Data
The examples on this page use the following data:
SELECT
VendorId,
ProductName,
ProductPrice
FROM Products
ORDER BY VendorId;
Result:
+----------+---------------------------------+--------------+ | VendorId | ProductName | ProductPrice | +----------+---------------------------------+--------------+ | 1001 | Left handed screwdriver | 25.99 | | 1001 | Right handed screwdriver | 25.99 | | 1001 | Long Weight (blue) | 14.75 | | 1001 | Long Weight (green) | 11.99 | | 1002 | Sledge Hammer | 33.49 | | 1003 | Chainsaw | 245.00 | | 1003 | Straw Dog Box | 55.99 | | 1003 | Hammock | 10.00 | | 1004 | Bottomless Coffee Mugs (4 Pack) | 10.00 | | 1004 | Tea Pot | 12.45 | +----------+---------------------------------+--------------+ 10 rows in set (0.00 sec)
Simple DISTINCT
Example
We can see from the above table that the VendorId
column contains several duplicate values. For example, 1001
appears four times, 1003
appears three times, and 1004
appears twice.
If we wanted a list of unique vendor IDs, without duplicates, we could do this:
SELECT
DISTINCT VendorId
FROM Products
ORDER BY VendorId;
Result:
+----------+ | VendorId | +----------+ | 1001 | | 1002 | | 1003 | | 1004 | +----------+ 4 rows in set (0.00 sec)
This time we only get four rows instead of the ten that we got in the previous query. This query returned just the unique vendor IDs, but no more.
Here’s what happens if we remove the DISTINCT
keyword:
SELECT
VendorId
FROM Products
ORDER BY VendorId;
Result:
+----------+ | VendorId | +----------+ | 1001 | | 1001 | | 1001 | | 1001 | | 1002 | | 1003 | | 1003 | | 1003 | | 1004 | | 1004 | +----------+ 10 rows in set (0.00 sec)
This time we got ten rows. That’s because some vendor IDs were listed multiple times, due to the fact that we didn’t “de-dupe” them.
Adding Columns
In the above example we selected just one column, which meant that the DISTINCT
modifier was applied to just that column. If we add additional columns to the SELECT
list, then the DISTINCT
modifier will apply to all of those columns.
Example:
SELECT
DISTINCT VendorId, ProductPrice
FROM Products
ORDER BY VendorId;
Result:
+----------+--------------+ | VendorId | ProductPrice | +----------+--------------+ | 1001 | 11.99 | | 1001 | 14.75 | | 1001 | 25.99 | | 1002 | 33.49 | | 1003 | 10.00 | | 1003 | 55.99 | | 1003 | 245.00 | | 1004 | 10.00 | | 1004 | 12.45 | +----------+--------------+ 9 rows in set (0.00 sec)
This time nine rows are returned. Only one row was de-duped. That’s because there was only one instance of a duplicate. In this case, vendor 1001
has two products with a price of 25.99
. Therefore, that combination is shown just once in the result.
Here’s what I mean:
SELECT
VendorId, ProductName, ProductPrice
FROM Products
WHERE ProductPrice = 25.99;
Result:
+----------+--------------------------+--------------+ | VendorId | ProductName | ProductPrice | +----------+--------------------------+--------------+ | 1001 | Left handed screwdriver | 25.99 | | 1001 | Right handed screwdriver | 25.99 | +----------+--------------------------+--------------+ 2 rows in set (0.00 sec)
We can see that vendor 1001
has two different products priced at 25.99
.
The DISTINCT ON
Option
PostgreSQL allows us to add an ON
to our DISTINCT
clause so that we can specify which column/s to apply the DISTINCT
clause to.
When we use this option, we use parentheses to specify the expression/s for which to apply the modifier to. It then keeps only the first row of each set of rows where the given expressions evaluate to equal.
Example:
SELECT
DISTINCT ON (VendorId) VendorId, ProductPrice
FROM Products
ORDER BY VendorId, ProductPrice;
Result:
vendorid | productprice ----------+-------------- 1001 | 11.99 1002 | 33.49 1003 | 10.00 1004 | 10.00 (4 rows)
We can change which values are returned within each vendor by modifying the ORDER BY
clause.
For example:
SELECT
DISTINCT ON (VendorId) VendorId, ProductPrice
FROM Products
ORDER BY VendorId, ProductPrice;
Result:
vendorid | productprice ----------+-------------- 1001 | 25.99 1002 | 33.49 1003 | 245.00 1004 | 12.45 (4 rows)
Changing the Column Order
Here’s the same example, but with the columns in a different order:
SELECT
DISTINCT ProductPrice, VendorId
FROM Products
ORDER BY ProductPrice;
Result:
+--------------+----------+ | ProductPrice | VendorId | +--------------+----------+ | 10.00 | 1003 | | 10.00 | 1004 | | 11.99 | 1001 | | 12.45 | 1004 | | 14.75 | 1001 | | 25.99 | 1001 | | 33.49 | 1002 | | 55.99 | 1003 | | 245.00 | 1003 | +--------------+----------+ 9 rows in set (0.00 sec)
I also sorted the results by a different column. Basically, changing the order of the columns didn’t change which rows were returned. We still eliminated duplicate rows, based on the columns that were being returned.
Using a Different Column
Whether or not any rows are de-duped depends entirely on the columns being returned and whether or they result in duplicate data in two or more rows.
Here’s what happens when I swap the ProductPrice
column for the ProductName
column:
SELECT
DISTINCT VendorId, ProductName
FROM Products
ORDER BY VendorId;
Result:
+----------+---------------------------------+ | VendorId | ProductName | +----------+---------------------------------+ | 1001 | Left handed screwdriver | | 1001 | Long Weight (blue) | | 1001 | Long Weight (green) | | 1001 | Right handed screwdriver | | 1002 | Sledge Hammer | | 1003 | Chainsaw | | 1003 | Hammock | | 1003 | Straw Dog Box | | 1004 | Bottomless Coffee Mugs (4 Pack) | | 1004 | Tea Pot | +----------+---------------------------------+ 10 rows in set (0.00 sec)
This time nothing was de-duped. All ten rows were returned. That’s because all of the product names are unique. This uniqueness meant that no two rows could contain exactly the same data. Therefore, the DISTINCT
keyword had no effect – we got the same results that we’d get if we removed it altogether.
Let’s remove the DISTINCT
keyword to demonstrate this:
SELECT
VendorId, ProductName
FROM Products
ORDER BY VendorId;
Result:
+----------+---------------------------------+ | VendorId | ProductName | +----------+---------------------------------+ | 1001 | Left handed screwdriver | | 1001 | Right handed screwdriver | | 1001 | Long Weight (blue) | | 1001 | Long Weight (green) | | 1002 | Sledge Hammer | | 1003 | Chainsaw | | 1003 | Straw Dog Box | | 1003 | Hammock | | 1004 | Bottomless Coffee Mugs (4 Pack) | | 1004 | Tea Pot | +----------+---------------------------------+ 10 rows in set (0.00 sec)
As expected, same result.
Counting Unique Values
We can use the DISTINCT
keyword inside the COUNT()
function in order to get a count of the unique values, without having to return the actual data itself.
Example:
SELECT
COUNT(VendorId) AS "Total",
COUNT(DISTINCT VendorId) AS "Unique"
FROM Products;
Result:
+-------+--------+ | Total | Unique | +-------+--------+ | 10 | 4 | +-------+--------+
Here, I used UNIQUE
in the first column but removed it in the second column. Therefore we can see both the total number of rows, and the number of unique rows.
The ALL
Keyword
When we use the DISTINCT
keyword, we’re overriding the default behaviour of the SQL query, which is to return all rows, regardless of duplicates.
This is because SQL queries implicitly use the ALL
keyword by default.
We can explicitly use ALL
instead of DISTINCT
in order to return all rows. Therefore we could modify our previous query as follows:
SELECT
COUNT(ALL VendorId) AS "Total",
COUNT(DISTINCT VendorId) AS "Unique"
FROM Products;
Result:
+-------+--------+ | Total | Unique | +-------+--------+ | 10 | 4 | +-------+--------+
So all I did was add the ALL
keyword to the first COUNT()
function. We got the same result that we got without this keyword. That’s because ALL
is the default behaviour.
Synonyms
Some RDBMSs allow for synonyms to be used instead of the DISTINCT
keyword.
For example, in MySQL we can use DISTINCTROW
instead of DISTINCT
if we prefer. And in Oracle Database we can use UNIQUE
as an alternative to DISTINCT
.
Here’s an example of using DISTINCTROW
in MySQL:
SELECT
DISTINCTROW VendorId
FROM Products
ORDER BY VendorId;
Result:
+----------+ | VendorId | +----------+ | 1001 | | 1002 | | 1003 | | 1004 | +----------+ 4 rows in set (0.00 sec)
As alluded to, it returns the same result as DISTINCT
, because it’s a synonym for DISTINCT
. That said, it’s probably a good idea to stick with using DISTINCT
unless you have reason not to.