Below are two options for removing duplicate rows from a table in PostgreSQL when those rows have a primary key or unique identifier column. The primary key is used in the query, but it’s ignored when comparing duplicates (due to the fact that primary keys prevent duplicate rows by definition).
The following examples delete duplicate rows but keep one. So in the case of say, three identical rows, it deletes two and keeps one.
Sample Data
Our examples use the following data:
SELECT * FROM Dogs;
Result:
dogid | firstname | lastname -------+-----------+---------- 1 | Bark | Smith 2 | Bark | Smith 3 | Woof | Jones 4 | Ruff | Robinson 5 | Wag | Johnson 6 | Wag | Johnson 7 | Wag | Johnson
We can see that the first two rows are duplicates, as well as the last three rows.
The DogId
column holds unique values (because it’s the table’s primary key), and therefore, strictly speaking, there are no duplicates. But in real life situations, you will often want to de-dupe tables that contain primary keys. Therefore, in this article we ignore the primary key and delete rows that contain duplicate values across the remaining columns.
Option 1
We can start by finding all rows that should be deleted:
SELECT * FROM Dogs
WHERE DogId IN (
SELECT DogId FROM Dogs
EXCEPT SELECT MIN(DogId) FROM Dogs
GROUP BY FirstName, LastName
)
ORDER BY FirstName, LastName;
Result:
dogid | firstname | lastname -------+-----------+---------- 2 | Bark | Smith 6 | Wag | Johnson 7 | Wag | Johnson
These are the excess rows from the list of duplicates. What I mean is that, these are the actual rows that will be deleted. For each set of duplicates listed here, there’s one that’s not listed (and will not be deleted). For example, the table contains three dogs called Wag Johnson, but there are only two Wag Johnson’s listed here – the two that will be deleted. The other Wag Johnson will remain in the database.
To remove those duplicate rows, we can switch the SELECT *
to DELETE
and remove the ORDER BY
clause:
DELETE FROM Dogs
WHERE DogId IN (
SELECT DogId FROM Dogs
EXCEPT SELECT MIN(DogId) FROM Dogs
GROUP BY FirstName, LastName
);
Result:
DELETE 3
To verify the result, we can select all remaining rows in the table:
SELECT * FROM Dogs;
Result:
dogid | firstname | lastname -------+-----------+---------- 1 | Bark | Smith 3 | Woof | Jones 4 | Ruff | Robinson 5 | Wag | Johnson
We can alternatively use the MAX()
function instead of the MIN()
function to change which rows are deleted.
Option 2
In this example we’ll assume that the table has been restored to its original state (with the duplicates).
Here’s another query that returns duplicate rows:
SELECT *
FROM Dogs d1, Dogs d2
WHERE d1.FirstName = d2.FirstName
AND d1.LastName = d2.LastName
AND d1.DogId <> d2.DogId
AND d1.DogId = (
SELECT MAX(DogId)
FROM Dogs d3
WHERE d3.FirstName = d1.FirstName
AND d3.LastName = d1.LastName
);
Result:
dogid | firstname | lastname | dogid | firstname | lastname -------+-----------+----------+-------+-----------+---------- 2 | Bark | Smith | 1 | Bark | Smith 7 | Wag | Johnson | 5 | Wag | Johnson 7 | Wag | Johnson | 6 | Wag | Johnson
We can modify that query to delete the duplicates:
DELETE FROM Dogs WHERE DogId IN (
SELECT d2.DogId
FROM Dogs d1, Dogs d2
WHERE d1.FirstName = d2.FirstName
AND d1.LastName = d2.LastName
AND d1.DogId <> d2.DogId
AND d1.DogId=(
SELECT MIN(DogId)
FROM Dogs d3
WHERE d3.FirstName = d1.FirstName
AND d3.LastName = d1.LastName
)
);
Result:
DELETE 3
The table has now been de-duped.
We can verify this by selecting all rows again:
SELECT * FROM Dogs;
Result:
dogid | firstname | lastname -------+-----------+---------- 1 | Bark | Smith 3 | Woof | Jones 4 | Ruff | Robinson 5 | Wag | Johnson
We can use MAX()
instead of MIN()
to delete the other rows from the duplicates if preferred.