Using a CTE with a DELETE Statement in SQL Server to Remove Duplicates

Common Table Expressions (CTEs) are a nifty SQL feature that allow us to define a temporary result set, which can then be used in subsequent queries, including DELETE statements. In the context of removing duplicate data, a CTE can be helpful when we want to identify and remove only the duplicated rows, keeping one version of each unique record.

Let’s go through a step-by-step example of using a CTE with a DELETE statement to remove duplicate data.

Step 1: Create a Sample Table

We’ll create a table called CustomerOrders where each customer may accidentally have duplicate orders.

CREATE TABLE CustomerOrders (
    OrderID INT IDENTITY(1,1) PRIMARY KEY,
    CustomerID INT,
    OrderDate DATE,
    Amount DECIMAL(10, 2)
);

Step 2: Insert Sample Data with Duplicates

Next, we’ll populate the table with data. Notice that some rows are intentionally duplicated.

INSERT INTO CustomerOrders (CustomerID, OrderDate, Amount)
VALUES
    (1, '2023-09-01', 100.00),
    (2, '2023-09-01', 150.00),
    (1, '2023-09-01', 100.00), -- Duplicate
    (3, '2023-09-02', 200.00),
    (2, '2023-09-01', 150.00), -- Duplicate
    (4, '2023-09-03', 250.00),
    (1, '2023-09-01', 100.00); -- Duplicate

In this dataset:

  • CustomerID = 1 has three identical orders.
  • CustomerID = 2 has two identical orders.

Step 3: Use a CTE to Identify Duplicates

The CTE will help us identify and rank the duplicate records. We’ll use the ROW_NUMBER() window function to assign a unique rank to each duplicate group, based on the CustomerID, OrderDate, and Amount.

WITH DuplicatesCTE AS (
    SELECT 
        OrderID,
        ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate, Amount ORDER BY OrderID) AS RowNum
    FROM CustomerOrders
)
SELECT * FROM DuplicatesCTE;

Output:

OrderID  RowNum
------- ------
1 1
3 2
7 3
2 1
5 2
4 1
6 1

Here, any row that has a row number 2 or greater is a duplicate row.

In this CTE:

  • The ROW_NUMBER() function assigns a number to each row within the partition of CustomerID, OrderDate, and Amount, ordering by OrderID.
  • The first row in each partition has RowNum = 1, and all subsequent duplicates have higher RowNum values.

We can take it a step further and return only those rows that will be deleted:

WITH DuplicatesCTE AS (
    SELECT 
        OrderID,
        ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate, Amount ORDER BY OrderID) AS RowNum
    FROM CustomerOrders
)
SELECT * FROM CustomerOrders
WHERE OrderID IN (SELECT OrderID FROM DuplicatesCTE WHERE RowNum > 1);

Output:

OrderID  CustomerID  OrderDate                 Amount
------- ---------- ------------------------ ------
3 1 2023-09-01T00:00:00.000Z 100
7 1 2023-09-01T00:00:00.000Z 100
5 2 2023-09-01T00:00:00.000Z 150

Step 4: Delete Duplicates

We can now use this CTE in a DELETE statement to remove rows where RowNum > 1, which are duplicates:

WITH DuplicatesCTE AS (
    SELECT 
        OrderID,
        ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate, Amount ORDER BY OrderID) AS RowNum
    FROM CustomerOrders
)
DELETE FROM CustomerOrders
WHERE OrderID IN (SELECT OrderID FROM DuplicatesCTE WHERE RowNum > 1);

This query deletes the duplicate rows, keeping the first occurrence of each unique CustomerID, OrderDate, and Amount combination.

Step 5: Verify the Result

You can verify that duplicates have been removed by querying the table after the deletion:

SELECT * FROM CustomerOrders;

Output:

OrderID  CustomerID  OrderDate                 Amount
------- ---------- ------------------------ ------
1 1 2023-09-01T00:00:00.000Z 100
2 2 2023-09-01T00:00:00.000Z 150
4 3 2023-09-02T00:00:00.000Z 200
6 4 2023-09-03T00:00:00.000Z 250

All duplicate rows have been removed, while the original rows remain intact.

Summary of our DELETE Operation

Here’s a quick summary of the above example:

  • CTE: The CTE (DuplicatesCTE) defines a temporary result set, identifying the rows to be deleted using ROW_NUMBER().
  • ROW_NUMBER() Function: This function assigns a sequential integer to rows within each partition. For duplicates, the first instance gets 1, and subsequent duplicates get higher numbers.
  • DELETE Statement: The DELETE statement targets rows with RowNum > 1 to remove all but the first occurrence of each duplicate.

Best Practices and Other Considerations

Deleting data always comes with risks. One risk is that we delete the wrong data. Another risk is that the business later decides that they actually need the deleted data. Or if it’s a big and/or busy database, then there’s the risk that our delete operation is going to slow everything down and cause problems for the users. And these are just some of the more obvious risks.

In consideration of the above, here are some things you might like to think about before going ahead and deleting a bunch of data from a database:

  • Use transactions to prevent accidental data loss. This will prevent a situation where only some of the data is deleted – either all deletes occur or none at all.
  • Back up the data to a separate table before deletion. That way, if you need to restore the deleted data later, you have everything you need. You only need to back up the data to be deleted – not the full table.
  • Test the logic with a SELECT before performing the DELETE. That way you can see everything that’s going to be deleted before it’s deleted, providing you with the opportunity to catch any potential errors in the delete logic.
  • Delete in batches to avoid locking or timeout issues. This could be a good idea if you’re working with large tables.
  • Monitor query performance with SQL Server statistics. SET STATISTICS IO and SET STATISTICS TIME can help you assess the I/O and execution time of your DELETE operation, which is critical in large-scale environments.
  • Consider adding indexes to speed up the ROW_NUMBER() calculation.
  • Use query hints to control query execution and avoid resource issues. For example OPTION (MAXDOP) to limit parallelism or WITH (TABLOCK) to minimize locking issues.
  • Leverage the OUTPUT clause to track and audit deleted rows. The OUTPUT clause returns deleted rows, or expressions based on them, as part of the DELETE operation. You can use this output to insert the deleted rows into a table or table variable if you want. Bear in mind that these rows are still output even if an error occurs and the operation is rolled back. In such cases, you would probably want to discard the resulting data (e.g., not insert it into a backup table, etc).