What's Considered a Duplicate?

I made a PDF of a Word document, then brought both the PDF and the DOCX file into DEVONthink, where they both show as duplicates of each other. In one sense they are duplicates, in that when viewed, their text content is the same and their appearance is similar. However they have different file sizes, and certainly differ in their MD5 hash or CRC checksums. What is the basis for their being shown as duplicates?

The files size and checksums are not considered. The content is the same, therefore they’re considered duplicates.

What is the best way to create a search (or something) that shows duplicates but takes into consideration that files are different? Specifically, in my case, DT marks pairs of Adobe Illustrator and exported PDF versions as duplicates though they functionally aren’t, since the PDF files—for reasons of interoperability with another unit—have some different settings (specifically, in this case, the PDF files do not have the “Preserve Illustrator Editing functions” option enabled and use a different compatibility setting).

Enabling the strict recognition of duplicates (see Preferences > General > General) should be sufficient to avoid such duplicates.

Thanks. That got rid of most of them. I’m going to dig deeper to see what the commonality is with the remaining not-quite-duplicate-duplicates.

Do you use the current version 3.9.1?

Yes.

Stricter recognition uses the: file type, file size, and content hash to detect duplicates. This should report actual duplicates of a file, not close matches.

Here is an example of two files showing as dupes when they are not, with stricter checking enabled:

The other items being reported as duplicates of this file are no longer being displayed.

Is this also the case after importing these files into a new database?

Well, in fact a bunch of new, valid duplicates appeared when I used the Index Files and Folders item to add the folder to a new DT db. In the end, I removed the folder entirely from the original db, used an external tool to deal with duplicates, added it back, and things seem to be working as expected.