Common Table Expressions (CTEs) are a nifty SQL feature that allow us to define a temporary result set, which can then be used in subsequent queries, including DELETE
statements. In the context of removing duplicate data, a CTE can be helpful when we want to identify and remove only the duplicated rows, keeping one version of each unique record.
Let’s go through a step-by-step example of using a CTE with a DELETE
statement to remove duplicate data.
Step 1: Create a Sample Table
We’ll create a table called CustomerOrders
where each customer may accidentally have duplicate orders.
CREATE TABLE CustomerOrders (
OrderID INT IDENTITY(1,1) PRIMARY KEY,
CustomerID INT,
OrderDate DATE,
Amount DECIMAL(10, 2)
);
Step 2: Insert Sample Data with Duplicates
Next, we’ll populate the table with data. Notice that some rows are intentionally duplicated.
INSERT INTO CustomerOrders (CustomerID, OrderDate, Amount)
VALUES
(1, '2023-09-01', 100.00),
(2, '2023-09-01', 150.00),
(1, '2023-09-01', 100.00), -- Duplicate
(3, '2023-09-02', 200.00),
(2, '2023-09-01', 150.00), -- Duplicate
(4, '2023-09-03', 250.00),
(1, '2023-09-01', 100.00); -- Duplicate
In this dataset:
CustomerID = 1
has three identical orders.CustomerID = 2
has two identical orders.
Step 3: Use a CTE to Identify Duplicates
The CTE will help us identify and rank the duplicate records. We’ll use the ROW_NUMBER()
window function to assign a unique rank to each duplicate group, based on the CustomerID
, OrderDate
, and Amount
.
WITH DuplicatesCTE AS (
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate, Amount ORDER BY OrderID) AS RowNum
FROM CustomerOrders
)
SELECT * FROM DuplicatesCTE;
Output:
OrderID RowNum
------- ------
1 1
3 2
7 3
2 1
5 2
4 1
6 1
Here, any row that has a row number 2 or greater is a duplicate row.
In this CTE:
- The
ROW_NUMBER()
function assigns a number to each row within the partition ofCustomerID
,OrderDate
, andAmount
, ordering byOrderID
. - The first row in each partition has
RowNum = 1
, and all subsequent duplicates have higherRowNum
values.
We can take it a step further and return only those rows that will be deleted:
WITH DuplicatesCTE AS (
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate, Amount ORDER BY OrderID) AS RowNum
FROM CustomerOrders
)
SELECT * FROM CustomerOrders
WHERE OrderID IN (SELECT OrderID FROM DuplicatesCTE WHERE RowNum > 1);
Output:
OrderID CustomerID OrderDate Amount
------- ---------- ------------------------ ------
3 1 2023-09-01T00:00:00.000Z 100
7 1 2023-09-01T00:00:00.000Z 100
5 2 2023-09-01T00:00:00.000Z 150
Step 4: Delete Duplicates
We can now use this CTE in a DELETE
statement to remove rows where RowNum > 1
, which are duplicates:
WITH DuplicatesCTE AS (
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID, OrderDate, Amount ORDER BY OrderID) AS RowNum
FROM CustomerOrders
)
DELETE FROM CustomerOrders
WHERE OrderID IN (SELECT OrderID FROM DuplicatesCTE WHERE RowNum > 1);
This query deletes the duplicate rows, keeping the first occurrence of each unique CustomerID
, OrderDate
, and Amount
combination.
Step 5: Verify the Result
You can verify that duplicates have been removed by querying the table after the deletion:
SELECT * FROM CustomerOrders;
Output:
OrderID CustomerID OrderDate Amount
------- ---------- ------------------------ ------
1 1 2023-09-01T00:00:00.000Z 100
2 2 2023-09-01T00:00:00.000Z 150
4 3 2023-09-02T00:00:00.000Z 200
6 4 2023-09-03T00:00:00.000Z 250
All duplicate rows have been removed, while the original rows remain intact.
Summary of our DELETE
Operation
Here’s a quick summary of the above example:
- CTE: The CTE (
DuplicatesCTE
) defines a temporary result set, identifying the rows to be deleted usingROW_NUMBER()
. ROW_NUMBER()
Function: This function assigns a sequential integer to rows within each partition. For duplicates, the first instance gets1
, and subsequent duplicates get higher numbers.DELETE
Statement: TheDELETE
statement targets rows withRowNum > 1
to remove all but the first occurrence of each duplicate.
Best Practices and Other Considerations
Deleting data always comes with risks. One risk is that we delete the wrong data. Another risk is that the business later decides that they actually need the deleted data. Or if it’s a big and/or busy database, then there’s the risk that our delete operation is going to slow everything down and cause problems for the users. And these are just some of the more obvious risks.
In consideration of the above, here are some things you might like to think about before going ahead and deleting a bunch of data from a database:
- Use transactions to prevent accidental data loss. This will prevent a situation where only some of the data is deleted – either all deletes occur or none at all.
- Back up the data to a separate table before deletion. That way, if you need to restore the deleted data later, you have everything you need. You only need to back up the data to be deleted – not the full table.
- Test the logic with a
SELECT
before performing theDELETE
. That way you can see everything that’s going to be deleted before it’s deleted, providing you with the opportunity to catch any potential errors in the delete logic. - Delete in batches to avoid locking or timeout issues. This could be a good idea if you’re working with large tables.
- Monitor query performance with SQL Server statistics.
SET STATISTICS IO
andSET STATISTICS TIME
can help you assess the I/O and execution time of yourDELETE
operation, which is critical in large-scale environments. - Consider adding indexes to speed up the
ROW_NUMBER()
calculation. - Use query hints to control query execution and avoid resource issues. For example
OPTION (MAXDOP)
to limit parallelism orWITH (TABLOCK)
to minimize locking issues. - Leverage the
OUTPUT
clause to track and audit deleted rows. TheOUTPUT
clause returns deleted rows, or expressions based on them, as part of theDELETE
operation. You can use this output to insert the deleted rows into a table or table variable if you want. Bear in mind that these rows are still output even if an error occurs and the operation is rolled back. In such cases, you would probably want to discard the resulting data (e.g., not insert it into a backup table, etc).