Delete Indexed Duplicate Files

My files are indexed, and I have a ton of duplicates that I need to delete. Since they are indexed I assume that I cannot delete them from DT because they will populate again upon indexing. These files have different names but same content. I am looking for an automated solution to delete around 1500 files. Pretty much all I can do now is identify them in a smart folder named “duplicate”.

When you select the option to move items from the Trash of a DEVONthink database to the System Trash, if there are Indexed items you will be given the option to also move the external Finder files to the System Trash.

But remember that DEVONthink classifies files as duplicates if they are highly similar, such as in text content.

For example, a PDF document and the plain text document resulting from use of Data > Convert > to plain text will be marked as duplicates, even though they are of different filetypes and would certainly not be identified as duplicates in the Finder.

Another interesting example in one of my databases is a collection of email messages from a colleague. He was in the habit of writing his messages in MS Word and attaching the Word files to new email messages that were otherwise identical in text content. As a result, DEVONthink identifies all messages from him as “highly similar”.

Or suppose you have a collection of documents that are forms used to contain responses to a survey. It’s quite possible that all or many of those documents would be identified as duplicates, as they might differ only in a few words of text.

My caution is that before deciding you “need” to eliminate items identified as duplicates by DEVONthink, you should remember that DEVONthink doesn’t treat duplicates as necessarily exact copies, as does the Finder.

As you are dealing with Indexed files, you might consider a utility to check for exact copies of files in the Finder environment, rather than within DEVONthink.

If you have a smart group with duplicates (confirmed duplicates - taking into account all of Bill’s advice) then for each set of duplicates select the ones you want to get rid of, right click and use Move into Database, then trash the documents you’ve moved into the database. Or, don’t trash them immediately but export them to a safe backup location, and then trash the document copies in the database.

@Bill, thanks very much. I had seen the trash option to remove files from my driver but had forgotten in this case. Yes, DT tends to treat highly similar file content as duplicates, thanks for reminding me. I tried your method and it worked, but found another thread tip easier, namely the See Also & Classify. The duplicates were automatically positions at the top for easy deletion. I did consider another software, Gemini. But DT is enough to solve the problem.