Differing files identified as duplicates

MikeP · November 16, 2016, 8:45pm

Hi,

Please, please, please could DT determine duplicates properly and reliably?

I reported this as a bug ticket (#983249) a long time ago, and the response was that you’ll “try to improve the recognition of duplicates and/or will add an option to apply a strict comparison to future releases”.

Today, over two years on, it still thinks files with a different date, time, and content are duplicates. It seems if the structure is similar enough that’s fine. I’ve even had it tell me bank statements from different months are the same.
Just the basics: if the file size is different, it’s probably not the same. Surely it can’t be that difficult?

Thanks,
Mike

BLUEFROG · November 17, 2016, 2:13pm

blog.devontechnologies.com/2014/ … evonthink/

MikeP · November 17, 2016, 2:59pm

I don’t mean to sound sarcastic, but what’s your point?

The blog you posted confirms what I just said:
One last note: A duplicate is not necessarily a byte-for-byte duplicate but can also be a “close match”.

This is precisely the problem.

Interestingly your blog suggests:
“Think about it – if the only difference in two things you’ve read is a single comma or sentence, you’d functionally consider them the same.”

It’s time to eat, grandma.
It’s time to eat grandma.

Yup, functionally the same, just the food that differs.

BLUEFROG · November 17, 2016, 4:30pm

This is a computer we’re talking about, not a person.

“,eat it’s grandma. to time” contains the EXACT same information as “It’s time to eat, grandma.”
The computer is NOT saying to itself, “Well, wait… what does this MEAN?? Hmmm…”

Yes, syntactical order can be used (as evidenced by the BEFORE, AFTER, and NEAR operators), but it still does NOT derive MEANING from the context. Even natural language entry apps, like Fantastical, don’t recognize MEANING.
And, especially with English, recognizing meaning is fraught with difficulties even foreigners are confounded by (and we are FAR, FAR smarter than computers).

PS: non-alphanumeric characters are not indexed, so your comma is superfluous to the AI and search function.

Bill_DeVille · November 17, 2016, 5:18pm

Duplicate in DEVONthink doesn’t mean the same thing as duplicate in the Finder.

Suppose I convert a Formatted Note document first to plain text, then also to rich text. I now have three documents that are identified as duplicates. They have different dates and file sizes. But they all have the same text content so are highly similar.

This concept of similarity is fundamental to the artificial intelligence algorithms that are at the kernel of DEVONthink databases. In the example of the three files above, all three would be treated as the same by the Classify and the See Also assistants, that is, with the same ranking based on words, their frequencies, and their associations.

Similarity is a sliding scale as documents start to diverge in content. If I begin to edit one of the above three documents, it will retain its identification as a duplicate so long as the changes are relatively minor, and will at some point lose that identification as changes continue to be made.

I often find it useful to convert a PDF that contains text recognition errors to a plain text document that’s identified as a duplicate of the PDF. If a name was incorrectly recognized, I may correct it in the plain text document, so that a search for that name will work. Assuming that’s the only change, the plain text document’s Info panel will still show the PDF as a duplicate, which is useful.

Quite often, I want to file a document into more than one group in a database. I could do that by duplicating the document, but DEVONthink offers a much better alternative, replicating the document to more than one location. Replicants are multiple instances of the same document, not copies of the document. Edit any instance of the document and all instances of it reflect the change. Moreover, only a few bits of storage overhead result from a new replicant, so this approach conserves file size.

MikeP · November 17, 2016, 5:40pm

I’m sorry, but I don’t see how this relates, either to my original question or to my response.

The quote from your blog, suggested that I, not the computer, wouldn’t consider a piece of writing with or without a comma functionally any different. I provided an example to refute that.

A computer can make an ‘is this identical’ comparison far faster than I can, but DT lacks the capability. I would go so far as to argue that a script to automatically delete “probable” duplicates is dangerous. There’s no way this is a duplicate:

I appreciate the ability to match two documents that are similar, and it’s a great feature. But ‘similar’ and ‘identical’ are two different concepts. I would like to identify documents that are the same.

Bill_DeVille · November 17, 2016, 9:29pm

Even Watson is far from matching human understanding of language.

Christian Grunenberg has done a remarkable job of making DEVONthink’s Classify and See Also assistants useful, even though they don’t reach the level of true semantic analysis. In a database with topically designed groups, Classify becomes quite accurate in suggesting appropriate group location(s) for a new document. That saves me time and effort in filing content into some of my databases that contain hundreds of groups. When I’m exploring a concept, See Also suggestions are often valuable. I really value a suggestion that I hadn’t thought of, as that can result in one of those Eureka! moments.

Mike, I understand your frustration. But to those DEVONthink algorithms, the three document files noted in a previous post are duplicates, although they differ in filetype, file size and date. And yes, those two PDFs in your screenshot are duplicates. Convert them to plain text and compare the text files, which should turn out to be highly similar – perhaps even identical, if the difference between the two PDFs was non-text, such as an image included in one but not the other.

You are correct that it could be dangerous to automatically delete files based on their identification as duplicates.

MikeP · November 17, 2016, 11:18pm

OK, I honestly didn’t have any idea the concept could be this misunderstood, so let me spell it out (again, no sarcasm intended):

By duplicate, I understand identical. This is the way the term is used and interpreted in the majority of cases.

If two differing files are flagged up as “duplicate”, this to me is wrong. I have ‘real’ and ‘fake’ duplicates in my DB, and it irks me that I have no way of cleansing the ‘real’ ones without reviewing them by hand.

I absolutely understand the matching assistants, and think they are a superb function and a top feature of DT. However they serve an entirely different purpose, “close match” is not “identical”.

The matching assistant will help me identify things like duplicate scans, which is a brilliant feature of DT. And I use classify almost daily, it’s a top notch tool!
But I also know that I have a bunch of identical files (multiple downloads, or downloaded and emailed, or various other reasons).
And there is no way of filtering so that I can eliminate “real” duplicates without having to open and compare files by hand. We have computers to do the comparing for us, so let’s use them!

MikeP · November 17, 2016, 11:28pm

OK, on reflection, here’s my suggestion:

Make a distinction between ‘probable match’ or ‘content match’, and ‘identical’.

Where a match is not a certainty, allow the user to unlink the duplicate so that they don’t appear int he duplicate lists for ever more as they currently do.

Gzk · November 17, 2016, 11:56pm

If I may interfere with a more practical question: Is there any way to undo/remove the duplicate status on files when there is no identity between them?

Bill_DeVille · November 18, 2016, 12:33am

No.

But there can be indicators of files whose content is really identical. If the filetype, file size and modification date of two files marked as duplicates are the same, they are likely identical.

If you select a document and choose Data > Duplicate, the duplicate will have copy appended to the Name. The creation and modification dates, filetype and file size will be the same. They are identical.

If you select a document marked as a duplicate and press Shift-Command-I to open its Info panel, you will see a field stating the number of duplicates. Click on that and the Names of the other duplicates will be displayed, and clicking on one reveals it in DEVONthink.

MikeP · November 18, 2016, 12:56am

Gzk and I are talking about the exact opposite: Disconnecting ‘duplicates’ that are different.

Real life example: I have two copies of a form filled in with different content. DT tells me they are duplicates. I want to tell DT to stop flagging them as a duplicates. They aren’t. Never will be.

Gzk · November 18, 2016, 9:56am

@Bill_DeVille: Thanks for your swift reply. As MikeP remarked, I also have false duplicates. Actually they couldn’t be more dissimilar. In fact, it’s a bunch of them (52), in some cases ‘triplicates’ of pdf made of camera shots that could not be OCR’d in a meaningful way (maps, drawings, etc). But they share one page or some text that indeed is common to all three, i.e. the TOC of a collection of articles. I need that TOC in all cases.

However, I am not addressing the coding / programming side of the issue. I just want to get rid of the confusing ‘duplicate’ badge on those dissimilar files. In the meanwhile, I am using the solution proposed here: Invalid duplicates. But that only helps at cleaning tasks in the Duplicates smart folder.

OogieM · November 21, 2016, 3:13pm

FWIW I’m getting far more false duplicates as well. In my case it’s SQL Queries, they are very similar but I too would like to unlink them as they are not actually duplicates. I agree that there needs to be some distinction between duplicate and similar items.

I have to admit that I never use the AI features of DT to sort, locate or file stuff so for me the AI nature of finding stuff that is similar isn’t really useful. I do however need to clearly know what is an exact copy and find those items. Especially since DT still on occasion duplicates nearly everything itself as part of some sync glitch.

BLUEFROG · November 22, 2016, 12:43am

Have you started a Support Ticket on this? I have seen no report of this.

OogieM · November 22, 2016, 3:12pm

Not recently, I just notice when it occurs and have been clearing the dups manually. It’s so intermittent that the last time I did report it Support couldn’t do anything. Seems to occur about once every couple of months or so. I will do a support ticket next time if that will help. I’m getting close to when it’s likely to occur again based on timing.

carlcasca · November 27, 2016, 11:03pm

Yikes, and thanks to the OP for pointing out this issue. I had always assumed duplicates meant exact duplicates, not just very similar. I guess there are cases when something as small as an extra space or a comma might not matter, but there are cases where a very tiny difference matters a lot. I use DTPO as a lab notebook. I often store data runs that will have 50000-100000 identical values except perhaps one difference in a single digit. That difference matters a lot. Both versions are equally important. I’d suggest that if it’s too cpu expensive to identify things by whether or not they are true duplicates, then the term should be changed. “Similar,” “Potential Duplicates,” something else? In any case, DTPO shouldn’t use the term duplicate for what it’s doing because that word has an expected meaning to a user, and DTPO is not following that convention. I’m a long-time user and really do appreciate how nice DTPO has become over the years. The OP has an excellent point.

daniel1113 · November 29, 2016, 3:29pm

carlcasca:

Yikes, and thanks to the OP for pointing out this issue. I had always assumed duplicates meant exact duplicates, not just very similar. I guess there are cases when something as small as an extra space or a comma might not matter, but there are cases where a very tiny difference matters a lot. I use DTPO as a lab notebook. I often store data runs that will have 50000-100000 identical values except perhaps one difference in a single digit. That difference matters a lot. Both versions are equally important. I’d suggest that if it’s too cpu expensive to identify things by whether or not they are true duplicates, then the term should be changed. “Similar,” “Potential Duplicates,” something else? In any case, DTPO shouldn’t use the term duplicate for what it’s doing because that word has an expected meaning to a user, and DTPO is not following that convention. I’m a long-time user and really do appreciate how nice DTPO has become over the years. The OP has an excellent point.

Agreed. It boggles my mind that this is even up for dispute. As implemented, the feature is confusing because it goes against well entrenched conventions and user expectations. Change the description/name to “near duplicates” or the like and all will be well.

Shoolie · November 30, 2016, 3:59am

I would find it helpful if the DEVON team could explain why DT declares not-identically-duplicate records as duplicates – is it a side effect of some aspect of the AI that is triggered by certain data or threshold of “sameness,” is there some use case where this behavior is useful, is there some other explanation?

OogieM · January 8, 2017, 6:33pm

Just happened again and a support ticket has been sent in.