The following examples use T-SQL to delete duplicate rows in SQL Server while ignoring the primary key or unique identifier column.
More specifically, the examples delete duplicate rows but keep one. So, given two identical rows, one is deleted and the other remains. This is often referred to as “de-duping” the table, “deduplication” of the table, etc.
Sample Data
Suppose we have a table with the following data:
SELECT * FROM Dogs;
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 1 | Bark | Smith | | 2 | Bark | Smith | | 3 | Woof | Jones | | 4 | Ruff | Robinson | | 5 | Wag | Johnson | | 6 | Wag | Johnson | | 7 | Wag | Johnson | +---------+-------------+------------+
We can see that the first two rows are duplicates, and so are the last three rows.
Option 1
First, let’s run the following code to check which rows are going to be de-duped:
WITH cte AS
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY FirstName, LastName
ORDER BY FirstName, LastName
) AS Row_Number
FROM Dogs
)
SELECT * FROM cte WHERE Row_Number <> 1;
Result:
+---------+-------------+------------+--------------+ | DogId | FirstName | LastName | Row_Number | |---------+-------------+------------+--------------| | 2 | Bark | Smith | 2 | | 6 | Wag | Johnson | 2 | | 7 | Wag | Johnson | 3 | +---------+-------------+------------+--------------+
We used the ROW_NUMBER()
function with the PARTITION BY
clause to create our own row number that increments when any duplicates are found, and resets when a non-duplicate is found. A number greater than 1 indicates that it’s a duplicate, and so we only return rows that have a number greater than 1.
We can see that three rows will be deleted when we de-dupe this table.
Now let’s de-dupe the table:
WITH cte AS
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY FirstName, LastName
ORDER BY FirstName, LastName
) AS Row_Number
FROM Dogs
)
DELETE FROM cte WHERE Row_Number <> 1;
Result:
(3 rows affected)
As expected, three rows were deleted.
This query is almost identical to the previous one. All we did was change SELECT *
on the last line to DELETE
.
Now let’s select all rows from the table in order to verify that the correct rows were deleted:
SELECT * FROM Dogs;
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 1 | Bark | Smith | | 3 | Woof | Jones | | 4 | Ruff | Robinson | | 5 | Wag | Johnson | +---------+-------------+------------+
We can see that each dog now appears just once in the table.
Option 2
Assuming that the table has been restored after the previous example, here’s another way of checking for duplicates:
SELECT * FROM Dogs
WHERE DogId IN (
SELECT DogId FROM Dogs
EXCEPT SELECT MIN(DogId) FROM Dogs
GROUP BY FirstName, LastName
);
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 2 | Bark | Smith | | 6 | Wag | Johnson | | 7 | Wag | Johnson | +---------+-------------+------------+
In this case, we made use of the EXCEPT
operator along with the MIN()
function. We could replace MIN()
with MAX()
depending on which rows we want to be deleted.
To delete the rows, we can simply replace SELECT *
with DELETE
:
DELETE FROM Dogs
WHERE DogId IN (
SELECT DogId FROM Dogs
EXCEPT SELECT MIN(DogId) FROM Dogs
GROUP BY FirstName, LastName
);
Result:
(3 rows affected)
And check to see what’s left:
SELECT * FROM Dogs;
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 1 | Bark | Smith | | 3 | Woof | Jones | | 4 | Ruff | Robinson | | 5 | Wag | Johnson | +---------+-------------+------------+
Option 3
Another way to do it is to join the table on itself and check for duplicates that way.
Assuming that the table has been restored after the previous example, here’s our third option for selecting duplicates:
SELECT *
FROM Dogs d1, Dogs d2
WHERE d1.FirstName = d2.FirstName
AND d1.LastName = d2.LastName
AND d1.DogId <> d2.DogId
AND d1.DogId = (
SELECT MAX(DogId)
FROM Dogs d3
WHERE d3.FirstName = d1.FirstName
AND d3.LastName = d1.LastName
);
Result:
+---------+-------------+------------+---------+-------------+------------+ | DogId | FirstName | LastName | DogId | FirstName | LastName | |---------+-------------+------------+---------+-------------+------------| | 2 | Bark | Smith | 1 | Bark | Smith | | 7 | Wag | Johnson | 5 | Wag | Johnson | | 7 | Wag | Johnson | 6 | Wag | Johnson | +---------+-------------+------------+---------+-------------+------------+
This result is not quite as clear as the one in the previous example, but we can still see which rows are duplicates.
Now we can modify that query so that we delete duplicate rows:
DELETE FROM Dogs WHERE DogId IN (
SELECT d2.DogId
FROM Dogs d1, Dogs d2
WHERE d1.FirstName = d2.FirstName
AND d1.LastName = d2.LastName
AND d1.DogId <> d2.DogId
AND d1.DogId=(
SELECT MAX(DogId)
FROM Dogs d3
WHERE d3.FirstName = d1.FirstName
AND d3.LastName = d1.LastName
)
);
Result:
(3 rows affected)
Once again, three rows were deleted.
Let’s check the table again:
SELECT * FROM Dogs;
Result:
+---------+-------------+------------+ | DogId | FirstName | LastName | |---------+-------------+------------| | 2 | Bark | Smith | | 3 | Woof | Jones | | 4 | Ruff | Robinson | | 7 | Wag | Johnson | +---------+-------------+------------+
You might notice that this time the other rows were deleted. In other words, we now have DogId
s 2, 3, 4, and 7 whereas in the previous examples we were left with 1, 3, 4, and 5.
We can easily change this example to delete the same rows as the previous examples. To do this, we can use the MIN()
function instead of the MAX()
function:
DELETE FROM Dogs WHERE DogId IN (
SELECT d2.DogId
FROM Dogs d1, Dogs d2
WHERE d1.FirstName = d2.FirstName
AND d1.LastName = d2.LastName
AND d1.DogId <> d2.DogId
AND d1.DogId=(
SELECT MIN(DogId)
FROM Dogs d3
WHERE d3.FirstName = d1.FirstName
AND d3.LastName = d1.LastName
)
);