The first query works by grouping together the rows on the unique column and counting rows. Anything with more than one row must be caused by duplicate values. If we're looking for duplicates of more than one column (or even all columns), then we have to use an SQL of the following form:
SELECT *
FROM mytable
WHERE (col1, col2, ... ,colN) IN
(SELECT col1, col2, ... ,colN
FROM mytable
GROUP BY col1, col2, ... ,colN
HAVING count(*) > 1);
Here, col1, col2, and so on until colN are the columns of the key.
Note that this type of query may need to sort the complete table on all the key columns. That will require sort space equal to the size of the table, so you'd better think first before running that SQL on very large tables. You'll probably benefit from a large work_mem setting for this query, probably 128 MB or more.
The DELETE FROM ... USING query that we showed only works with PostgreSQL because it uses the ctid value, which is the internal identifier of each row in the table. If you wanted to run that query against more than one column, as we did earlier in the chapter, you'd need to extend the queries in step 3, as follows:
SELECT customerid, customer_name, ..., min(ctid) AS min_ctid
FROM ...
GROUP BY customerid, customer_name, ...
...;
Then, extend the query in step 4, like this:
DELETE FROM new_cust
...
WHERE new_cust.customerid = dups_cust.customerid
AND new_cust.customer_name = dups_cust.customer_name
AND ...
AND new_cust.ctid != dups_cust.min_ctid;
The preceding query works by grouping together all the rows with similar values and then finding the row with the lowest ctid value. The lowest will mean closer to the start of the table, so duplicates will be removed from the far end of the table. When we run VACUUM, we may find that the table gets smaller, because we have removed rows from the far end.
The BEGIN and COMMIT commands wrap the LOCK and DELETE commands into a single transaction, which is required. Otherwise, the lock will be released immediately after being taken.
Another reason to use a single transaction is that we can always roll back if anything goes wrong, which is a good thing when we are removing data from a live table.