How to remove duplicates that are different file types


I use DTPO to store all my email. I do this by archiving all my email in a database. Last year for a short period I decided to use Email Archiver which converts all your emails into PDF. I’ve decided I want all my email in one place and imported the PDF emails into DTPO also.

My problem is that some of the PDF’s are duplicates of emails already in the database. The duplicate finder won’t pick this up. I have some 15,000 PDF’s and have no desire to go though manually. As not all PDF’s are duplicates I cannot jut delete PDF’s. In the listing the eml file in DTPO and the PDF file carry the same date and timestamp. Email Archiver adds the date and a letter to the email name as below:

Original imported email name:

Google Apps: sign-up confirmation and next steps

Duplicate PDF name:

2011-12-02 09.40.12Z  Google Apps  sign-up confirmation and next steps.pdf

Is there any way to find all the PDF’s that are duplicates so I can remove them?

Any help would be much appreciated!

How short is short? I’m thinking that a smart group that tests (a) for everything in the database that was added in the window where you were using Email Archiver, and (b) was either an .eml or a .pdf? That might not be the precise definition of the smart group – but perhaps something along those lines would help.

Another approach would be to do a regex renaming of .pdfs that have a prefix of the form you mentioned. (See Scripts > Rename > Rename using RegEx.) The renaming would get rid of the prefix, so a simple sort would put the two documents side by side.

Perhaps if you selected a Email Archiver-created PDF and clicked See Also, DEVONthink would suggest the .eml as a match?

There’s no way, unfortunately, to tell DEVONthink that two documents are duplicates (or, for that matter, that two documents are not duplicates).