A DBA may be often required to delete duplicate rows from a SQL Server database table. Several methods can be used keeping in view the resources and exact situation. Here I will create a specific scenario. After going through this scenario, you will be able to remove the duplicate rows in any given scenario on ground.
Run following script to generate duplicate data in a table named “dups” created in any database that you select. Here I will use “pubs”.
-- Create table in any database create table Dups (col1 varchar(50), col2 varchar(50), col3 varchar(50)) GO -- Populate the table for duplicate rows INSERT INTO dups SELECT 'Val1','val2','Val3' UNION ALL SELECT 'Val1','Val2','Val13' UNION ALL SELECT 'Val1','Val2','Val3' UNION ALL SELECT 'Val1','Val2','Val13' UNION ALL SELECT 'Val11','Val12','Val3' UNION ALL SELECT 'Val16','Val18','Val20' UNION ALL SELECT 'Val11','Val12','Val3' UNION ALL SELECT 'Val16','Val18','Val20' UNION ALL SELECT 'Val6','Val7','Val6' GO
Table is created and populated with duplicate rows in it now.
Total rows are 9
Distinct rows are 5, which we have to preserve at the end.
Now we have to remove all duplicate rows, leaving behind just 5 distinct rows in three columns.
Execute the following script to complete the operation
-- Remove duplicate rows ALTER TABLE dups ADD ser SMALLINT PRIMARY KEY identity(1,1) GO DELETE FROM dups WHERE ser not in (SELECT max(ser) FROM dups GROUP BY col1,col2,col3) GO
At the end now we have all duplicate rows removed and our table is with an identity column “ser”. You may also drop this column by using
-- Drop the temporary column created ALTER TABLE dups DROP COLUMN ser GO
but for that you have to first drop the constraint attached with it. As in my case it is as follows
-- Drop PK constraint ALTER TABLE dups DROP PK__dups__286302EC GO
So it is over. You may get include any number of columns in your distinct rows wish list. You may modify the ‘group by’ parameters to meet any number of columns to consider in duplicate row . Query uses the magic of ‘group by.’ Here it is appropriate to mention that “distinct” and “group by” both produce same result and almost same resource utilization. But one of important difference between both is that group by makes you able to apply an aggregate function on each group of data separately. That fact has used here with “max” in our duplicate removal query.
Click here to read an excellent article on this topic by Arshad Ali. It covers different ways to deal with duplicate rows.