Identifying Duplicates

I imported two emails from Mail. The emails have two different Names and two different sized attachments with different names. DEVONthink’s duplicate folder is showing these emails as duplicates of one another. An image file is attached showing the names and file sizes.

I would be grateful if someone could tell me how to change the default settings of Duplicates so that it correctly identifies two different emails.

Thanks for your help.

Bill

DEVONthink determines similarity based on content (only) of a document. Differences in names and in the size of attachments are not relevant to the determination as to whether or not content is sufficiently similar as to be marked a duplicate. There is no user-accessible modification of similarity determination.

Documents such as forms that have perhaps only a few characters different in content may be marked as duplicates. As attachments to emails are not indexed, differences in attachments are not relevant.

Thus, duplicate documents are not necessarily identical documents. DEVONthink focusses on information in the content of a document and such information content is used by the Classify and See Also procedures.

Thanks Bill. So if a Duplicate is not a “duplicate” i.e. does not have to be an exact duplicate I can foresee some serious problems if I proceed with my plan to import one set of 5,000 emails into DEVONthink Pro and it finds several hundred “duplicates”.

Firstly I don’t really have the time to inspect each one manually to determine if it is or is not a “duplicate” and delete the ones that are and retain the ones that are not. And secondly, if the duplicates are not “duplicates” how can I stop them appearing in Duplicates.

What would you suggest to a novice user, who’s new to DEVONthink and needs to find a workaround quickly. Thanks for your advice.

I would appreciate advice or suggestions from other users.

Bill in Switzerland

  1. We usually think of duplicate files as ones that can be eliminated to reduce redundancy of information and to save storage space.

  2. It is possible to send all duplicates to the Trash. But if you grok that items that a marked as duplicates may not be completely redundant, and if you aren’t in dire need of extra hard drive storage space, it’s probably a good idea to think before sending all duplicates to the Trash.

For example, I had a colleague who never sent out email messages with content in the body of the message. All the information was contained in an attached Word document. DEVONthink looks at his email messages and flags all of them as duplicates. That’s true, as they all have identical content. But I wouldn’t want to delete all but one of his messages. :slight_smile:

Or suppose you send out a form survey to people, asking them to reply by entering "yes’ or “no” to proposed options. DEVONthink will probably see their responses as duplicates.

If seeing duplicates marked in your database in blue bold font irritates you, there’s a less intrusive marking of duplicates using a symbol, and the Name will appear in regular black font.

Just try to grok that, in DEVONthink, “duplicate” is to be interpreted fuzzily. Documents marked as duplicates might be identical, or might differ a bit. DEVONthink is looking at document content similarity, not at name, modification date, file size or even filetype. So a searchable PDF and a text file created from that PDF via Data > Convert > to plain text will be marked as duplicates; yes, they do have the same content. :slight_smile:

Thanks again for taking time to post that. It’s very helpful. Is there any way to stop the “duplicates” from appearing in duplicates? I’d like to add emails in lots of a 1,000 and investigate the duplicates but stop the ones I need to keep from appearing in the Duplicates folder.

Bill

No, the Duplicates smart group will include all items that are considered (by DEVONthink) to be duplicates.

Hello All,

I am a brand new noobie, really, but I face EXACTLY the OPPOSITE situation,
I populated a folder in a database by dragging and dropping emails searched using various criteria, directly from Apple Mail. So I have about 500 entries (all .eml files), with a large percentage of IDENTICAL records.

However, Devonthink Pro v2.0.3 seems unable to detect any duplicates.(Most likely, I am unable to coerce DTP into helping me identify and erase them…)

Is there a simple solution to this?

Thanks a bunch

Tom