Sometimes when we run a SQL query, we only want to see unique rows. But depending on the columns we’re selecting, we might end up with duplicate rows. And this could happen without even knowing it, especially with large data sets.
But it doesn’t have to be this way.
Fortunately most SQL databases provide us with an easy way to remove duplicates.
The DISTINCT
Clause
The most common way to remove duplicate rows from our query results is to use the DISTINCT
clause. The simplest way to use this is with the DISTINCT
keyword at the start of the SELECT
list.
Suppose we have a table like this:
SELECT * FROM Dogs;
Result:
+-------+-----------+----------+-----+ | DogId | FirstName | LastName | Age | +-------+-----------+----------+-----+ | 1 | Bark | Smith | 10 | | 2 | Wag | Frazier | 7 | | 3 | Bark | Jones | 10 | +-------+-----------+----------+-----+
This table contains two dogs named Bark, and coincidentally, they are both aged 10.
Now let’s run a query that selects just two columns from that table:
SELECT FirstName, Age
FROM Dogs
ORDER BY FirstName;
Result:
+-----------+-----+ | FirstName | Age | +-----------+-----+ | Bark | 10 | | Bark | 10 | | Wag | 7 | +-----------+-----+
Our query has resulted in duplicate rows; the first two rows have the same values in both columns (Bark, aged 10).
Having duplicate rows isn’t necessarily a bad thing. However, if we don’t want them, we can use the DISTINCT
clause to return just unique rows. In other words, it will remove any redundant duplicate rows, and return just one row, no matter how many duplicates there are.
Here’s how to use the DISTINCT
clause to return unique rows:
SELECT DISTINCT FirstName, Age
FROM Dogs
ORDER BY FirstName;
Result:
+-----------+-----+ | FirstName | Age | +-----------+-----+ | Bark | 10 | | Wag | 7 | +-----------+-----+
So, all we did was prefix the SELECT
list with the DISTINCT
keyword. Now only unique rows are returned by our query (we only get one row that contains Bark aged 10).
The GROUP BY
Clause
Another method we can use to eliminate duplicates is to use the GROUP BY
clause. The GROUP BY
clause enables us to group our results by a given column or columns.
Example:
SELECT FirstName, Age
FROM Dogs
GROUP BY FirstName, Age
ORDER BY FirstName, Age;
Result:
+-----------+-----+ | FirstName | Age | +-----------+-----+ | Bark | 10 | | Wag | 7 | +-----------+-----+
Same result as the previous example. The GROUP BY
clause had the effect of removing the redundant duplicate row from our query results.
One benefit of using this method is that we can incorporate the COUNT()
function into our SELECT
list to return the number of duplicate rows, if any:
SELECT FirstName, Age, COUNT(*)
FROM Dogs
GROUP BY FirstName, Age
ORDER BY FirstName, Age;
Result:
+-----------+-----+----------+ | FirstName | Age | COUNT(*) | +-----------+-----+----------+ | Bark | 10 | 2 | | Wag | 7 | 1 | +-----------+-----+----------+
The DISTINCT ON()
Option
At least one RDBMS (PostgreSQL) allows us to use an ON
option with our DISTINCT
clause. The ON
option allows us to specify which rows to evaluate when removing duplicates. It keeps only the first row of each set of rows where the given expressions evaluate to equal.
Here’s an example:
SELECT DISTINCT ON(FirstName) FirstName, LastName, Age
FROM Dogs
ORDER BY FirstName;
Result:
firstname | lastname | age -----------+----------+----- Bark | Smith | 10 Wag | Frazier | 7
In this example, we used ON(FirstName)
to dedupe rows that have the same value in the FirstName
column. This allowed us to return columns that don’t necessarily contain duplicate values. Specifically, the we could include the LastName
column in our query, even though that column contains unique values, while still deduping the rows based on the FirstName
column.
To illustrate what I mean, let’s remove ON(FirstName)
from the query:
SELECT DISTINCT FirstName, LastName, Age
FROM Dogs
ORDER BY FirstName;
Result:
firstname | lastname | age -----------+----------+----- Bark | Smith | 10 Bark | Jones | 10 Wag | Frazier | 7
This time we got all three rows, because they are all unique when all columns are taken into account. When using this syntax, the DISTINCT
clause takes into account all three rows.