Here are examples of using SQL to return duplicate rows when those rows have a primary key or other unique identifier column.
These queries work in most of the major RDBMSs, including SQL Server, Oracle, MySQL, MariaDB, PostgreSQL, and SQLite.
Sample Data
Suppose we have a table with the following data:
SELECT * FROM Dogs;
Result:
+-------+-----------+----------+ | DogId | FirstName | LastName | +-------+-----------+----------+ | 1 | Bark | Smith | | 2 | Bark | Smith | | 3 | Woof | Jones | | 4 | Ruff | Robinson | | 5 | Wag | Johnson | | 6 | Wag | Johnson | | 7 | Wag | Johnson | +-------+-----------+----------+
The duplicate rows share exactly the same values across all columns except for their primary key/unique ID column.
The primary key ensures that there are no duplicate rows, which is normally a good thing in RDBMSs. However, by definition this means that there are no duplicate rows. In our case, the primary key column is an incrementing number, and its value carries no meaning and is not significant. We therefore need to ignore that row if we want to find duplicates in the columns that are significant.
Count the Duplicates
Here’s a query that counts the duplicate rows. More specifically, it groups the rows by their relevant columns, then counts how many are in each group. Groups containing more than one are duplicates.
SELECT
FirstName,
LastName,
COUNT(*) AS Count
FROM Dogs
GROUP BY FirstName, LastName;
Result:
+-------------+------------+---------+ | FirstName | LastName | Count | |-------------+------------+---------| | Wag | Johnson | 3 | | Woof | Jones | 1 | | Ruff | Robinson | 1 | | Bark | Smith | 2 | +-------------+------------+---------+
We used SQL’s GROUP BY
clause in our SELECT
statement, and then the COUNT()
function to count how many rows are in each group.
We can see that Bark Smith occurs twice, and Wag Johnson three times.
In this case, we were able to ignore the primary key column by omitting it from our query.
Move Duplicates to the Top
We can use the ORDER BY
clause to order the results by the count in descending order. This makes the duplicate rows appear first.
SELECT
FirstName,
LastName,
COUNT(*) AS Count
FROM Dogs
GROUP BY FirstName, LastName
ORDER BY COUNT(*) DESC;
Result:
+-------------+------------+---------+ | FirstName | LastName | Count | |-------------+------------+---------| | Wag | Johnson | 3 | | Bark | Smith | 2 | | Woof | Jones | 1 | | Ruff | Robinson | 1 | +-------------+------------+---------+
List Only the Duplicates
We can use SQL’s HAVING
clause to exclude non-duplicates.
SELECT
FirstName,
LastName,
COUNT(*) AS Count
FROM Dogs
GROUP BY FirstName, LastName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
Result:
+-------------+------------+---------+ | FirstName | LastName | Count | |-------------+------------+---------| | Wag | Johnson | 3 | | Bark | Smith | 2 | +-------------+------------+---------+
Provide a Running Count of Duplicate Rows
If you need to list out all duplicate rows, the above examples won’t cut it.
But not a problem. We can use the ROW_NUMBER()
function instead to do just that:
SELECT
DogId,
FirstName,
LastName,
ROW_NUMBER() OVER (
PARTITION BY FirstName, LastName
ORDER BY FirstName, LastName
) AS rn
FROM Dogs;
Result:
+---------+-------------+------------+------+ | DogId | FirstName | LastName | rn | |---------+-------------+------------+------| | 1 | Bark | Smith | 1 | | 2 | Bark | Smith | 2 | | 4 | Ruff | Robinson | 1 | | 5 | Wag | Johnson | 1 | | 6 | Wag | Johnson | 2 | | 7 | Wag | Johnson | 3 | | 3 | Woof | Jones | 1 | +---------+-------------+------------+------+
Using the PARTITION
clause allows us to partition the results based on the columns we’re trying to find duplicates in. Whenever duplicates are found, they are numbered as such. For example, if two rows are duplicates, they’re numbered 1 and 2. If three rows are duplicates, they’re numbered 1, 2, and 3 (as we can see with the dog called Bark).
In the above example, we can see the primary key column on the left, and the running count of the duplicates on the right.
And to list out just the excess duplicates, we can use the above query as a common table expression:
WITH cte AS
(
SELECT
DogId,
FirstName,
LastName,
ROW_NUMBER() OVER (
PARTITION BY FirstName, LastName
ORDER BY FirstName, LastName
) AS rn
FROM Dogs
)
SELECT * FROM cte WHERE rn <> 1;
Result:
+---------+-------------+------------+------+ | DogId | FirstName | LastName | rn | |---------+-------------+------------+------| | 2 | Bark | Smith | 2 | | 6 | Wag | Johnson | 2 | | 7 | Wag | Johnson | 3 | +---------+-------------+------------+------+
This query can be used to return a list of rows that would be removed if we ran a de-duping operation against the table.
In some DBMSs (such as SQL Server), we can replace SELECT *
on the last row with DELETE
in order to delete these rows.
3 Queries that Include the Primary Key
The previous examples could work just as easily on tables that don’t have a primary key, as they specifically select the columns that we want to compare for duplicates.
Below are three queries that reference the primary key column (while excluding it from their duplicate evaluation).
Query 1:
SELECT * FROM Dogs
WHERE EXISTS (
SELECT 1 FROM Dogs d2
WHERE Dogs.FirstName = d2.FirstName
AND Dogs.LastName = d2.LastName
AND Dogs.DogId > d2.DogId
);
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 2 | Bark | Smith | | 6 | Wag | Johnson | | 7 | Wag | Johnson | +---------+-------------+------------+
Query 2:
SELECT * FROM Dogs
WHERE DogId > (
SELECT MIN(DogId) FROM Dogs d2
WHERE Dogs.FirstName = d2.FirstName
AND Dogs.LastName = d2.LastName
);
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 2 | Bark | Smith | | 6 | Wag | Johnson | | 7 | Wag | Johnson | +---------+-------------+------------+
Query 3:
SELECT * FROM Dogs
WHERE DogId IN (
SELECT DogId FROM Dogs
WHERE DogId NOT IN (SELECT MIN(DogId) FROM Dogs
GROUP BY FirstName, LastName)
);
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 2 | Bark | Smith | | 6 | Wag | Johnson | | 7 | Wag | Johnson | +---------+-------------+------------+
Similar to the previous example, these could be converted to a de-duping query by changing the SELECT *
to DELETE
.