Find Duplicates

sjk · May 23, 2009, 9:55pm

Can you briefly list some or all of those approximation techniques? Thanks!

ntahall · May 23, 2009, 11:15pm

Thanks, I will do gather some examples and e-mail them to you.

Could you please explain what the techniques are so that I can understand how much faith I should put into it? It seems already that I can’t trust it for completed PDF forms and possibly some other documents.

annard · May 24, 2009, 6:36am

I’m afraid I can’t do that because I don’t have that knowledge. Christian does but he’s offline to take care of his baby for the next 2 weeks.

In the case of emails it’s either the unique message-id or the SHA-1 hash of the complete message contents.

ntahall · June 13, 2009, 9:55pm

Sorry about the delay. 2 examples sent to the e-mail address as requested.
Nick

paljan · December 27, 2012, 2:29pm

hi,

I’m currently testing DEVONthink pro Office and I’m very terrorized because it identify files as duplicates that are not.

This issue was written on this forum on 2009 and seems to be already present…

Thanks in advance for your reply

Bill_DeVille · December 27, 2012, 9:05pm

I’m not terrorized by the fact that the definition of “duplicate” is different in DEVONthink than in the Finder.

On the contrary, I find it useful in a very important way that DEVONthink sees a PDF and a plain text file containing the text of the PDFs as duplicates, although that wouldn’t be the case in the Finder.

DEVONthink examines similarities of the textual information content of files. The Finder does not.

In a decision to mark two documents as duplicates of each other, DEVONthink doesn’t consider their filetypes, Names, Creation Date or other metadata. The Finder does, and so would not consider that PDF and that plain text file to be duplicates.

There are artificial intelligence routines at the kernel of a DEVOMthink database, from which much of DEVONthink’s power is derived. These AI routines examine the contextual relationships among words, and their frequencies of use.

When DEVONthink determined that the PDF and the plain text document were duplicates, it did so using AI analysis that determined that the contextual relationships of the text content was very, very similar.

Similarity in not the same as “exactly the same”. So, in DEVONthink small differences in the text content of two documents marked as duplicates may exist. For example, if I were to type the same text into a series of Word, Pages, plain text and rich text documents and collect them into a database, DEVONthink would mark them as duplicates, even in the case that I had made different typos in each, so that their text contents were not exactly the same.

Or suppose I have a collection of filled-out forms, some or all of which differ by only a few words of text content. DEVONthink might mark some or all of them as duplicates.

As the concept of “duplicate” is sometimes fuzzy in DEVONthink (which can be desirable), it may not be wise to blindly delete duplicates.

paljan · December 27, 2012, 10:47pm

hi Bill,

first of all thank you for your prompt response.

If I well understood what DEVONthink consider as “duplicates” for me is just “similar”.

I never compared it with Finder, also a “similar” functionality is very important in a DMS, but maybe the word “duplicates” is a little bit misleading.

In the attachment you can find 2 JPGs of two different receipts that DEVONthink consider as duplicates, sincerely I don’t think that they could be so similar…

You can imagine the difficulties that I could have managing educational sheets where the words are similar but they have different pictures.

Thank you

edit: attachment deleted as suggested, If needed I can send it by email.

korm · December 27, 2012, 11:15pm

It’s a bad idea to post your personal data to a public forum. I suggest you delete that file.

That said - if DEVONthink flags documents as duplicates, but you know by looking at the preview, the name, or other attributes that they are not – then ignore DEVONthink. DEVONthink isn’t going to delete duplicates, move them somewhere, or do anything else to the files unless you command it to do so.

paljan · January 7, 2013, 5:44am

Hi Bill,

any news for me?

There is the way to identify the file 100% identical?

Bill_DeVille · January 7, 2013, 6:56am

Your example of files that have very similar text content but different pictures is an example of files that may very well be marked as duplicates by DEVONthink. That hasn’t changed.

The likelihood that such identification of documents that are not exactly identical will depend on database content. If this happens, it would be unwise to simply delete duplicates.

My databases contain very few duplicates, and all of them that I’ve checked out are either exact copies of files, or “duplicates” consisting of PDFs and the text files corresponding to the text content of those PDFs, that I created to check OCR accuracy or to edit.

For example, in my Main database which contains some 30,000 documents, a search for instances that are duplicates (using the Advanced button in the Search window) produced 22 results. In every case, these were instances of pairs of files that were of different filetypes, but that contained the same text (or very nearly the same text). An example is a scan of a legal document in which a court reporter’s stamp overlaid a couple of words that might be important in searching the content, so I created a plain text conversion file and corrected the OCR errors. In this database, there’s no issue of “false positives” in DEVONthink’s marking them as duplicates.

But in another database, an archive of old emails, every message sent by a colleague is marked as a duplicate. Why? Because that person sent messages that contained only his name (signature) in the message body content. He preferred writing the message as a Word document that was attached to the email. As DEVONthink doesn’t index the word content of attachments, DEVONthink sees all messages from him as duplicates. I’m not going to argue with DEVONthink about that, nor will I arbitrarily delete any of his messages because they are marked as duplicates.