23921 duplicates

vatolin · September 5, 2014, 6:02am

Morning.

I’ve indexed a directory. Now the smart group »Duplicates« shows, that this directory contains 23921 duplicates. At first glance a lot of the files are not at all duplicates, there are three, fore, five or more instances of some of these files in the indexed directory.

Now I would like to keep only one instance of those files and replace all others by aliases.

How can I achieve it?

Thanks in advance for your support!

Kind regards, Friedrich

Frederiko · September 5, 2014, 8:14am

I don’t know what algorithm DT uses to compare files to determine similarity but it seems pretty basic and not be relied upon except as a general indication. As far as I know the DT duplicate algorithm does not seem to work by generating hashes on the files or some other technique.

I am sure it might be possible to script something.

My quick and dirty approach would be to generate an MD5 hash (using md5sum on the command line) of each file and store it in spotlight comment field. Sorting on the spotlight column would then tell you which files were true duplicates (because they would have the same hash). From there a script to replace duplicates with replicants shouldn’t be too hard .

Sorry I just don’t have time right now to have a stab at it. It could be really useful.

Frederiko

BLUEFROG · September 5, 2014, 12:12pm

It’s actually not that basic, as it’s using the AI to determine duplicate status.
The problem with “duplicate” as a term is that it’s not a byte for byte duplicate (whereas a Replicant is). Consider a document with 1000 words in it. If you create a copy, it’s a duplicate in the way people imagine it is. But say you change 10 words in the document, it’s still 99% the same, so it’s seen as a duplicate. The more you edited the document, the more likely the duplicate status would be changed.

So just because you have “duplicates”, it doesn’t mean extra copies.

korm · September 5, 2014, 12:36pm

Are you sure? Did you inspect the original data in the original folder(s)?

A script might be “easy” (maybe not), but personally I would never trust a script to clean up 24,000 files in the file system or DEVONthink, or anywhere else. I would carefully examine the root cause and work through the problem methodically.

And make backups before changing anything.

Frederiko · September 5, 2014, 3:39pm

@BLUEFROG

Thanks, that clarifies a lot of the confusion I had about duplicates. I do think though that an ‘exact duplicate’ setting would be useful and would probably be more what most people expect.

Frederiko

Greg_Jones · September 5, 2014, 6:53pm

False duplicates are frustrating, however… DEVONthink flags documents as duplicates based on content, not on byte for byte criteria as Jim mentioned. As one example, I’d rather be informed that I have duplicate content, even though one of the duplicate documents may be a PDF while another may be RTF, while another may be HTML, etc. Flagging documents that are exact duplicates while excluding documents that are otherwise content duplicates would not be of value to me.