Differing files identified as duplicates

MikeP · November 17, 2016, 5:40pm

I’m sorry, but I don’t see how this relates, either to my original question or to my response.

The quote from your blog, suggested that I, not the computer, wouldn’t consider a piece of writing with or without a comma functionally any different. I provided an example to refute that.

A computer can make an ‘is this identical’ comparison far faster than I can, but DT lacks the capability. I would go so far as to argue that a script to automatically delete “probable” duplicates is dangerous. There’s no way this is a duplicate:

I appreciate the ability to match two documents that are similar, and it’s a great feature. But ‘similar’ and ‘identical’ are two different concepts. I would like to identify documents that are the same.

Bill_DeVille · November 17, 2016, 9:29pm

Even Watson is far from matching human understanding of language.

Christian Grunenberg has done a remarkable job of making DEVONthink’s Classify and See Also assistants useful, even though they don’t reach the level of true semantic analysis. In a database with topically designed groups, Classify becomes quite accurate in suggesting appropriate group location(s) for a new document. That saves me time and effort in filing content into some of my databases that contain hundreds of groups. When I’m exploring a concept, See Also suggestions are often valuable. I really value a suggestion that I hadn’t thought of, as that can result in one of those Eureka! moments.

Mike, I understand your frustration. But to those DEVONthink algorithms, the three document files noted in a previous post are duplicates, although they differ in filetype, file size and date. And yes, those two PDFs in your screenshot are duplicates. Convert them to plain text and compare the text files, which should turn out to be highly similar – perhaps even identical, if the difference between the two PDFs was non-text, such as an image included in one but not the other.

You are correct that it could be dangerous to automatically delete files based on their identification as duplicates.

MikeP · November 17, 2016, 11:18pm

OK, I honestly didn’t have any idea the concept could be this misunderstood, so let me spell it out (again, no sarcasm intended):

By duplicate, I understand identical. This is the way the term is used and interpreted in the majority of cases.

If two differing files are flagged up as “duplicate”, this to me is wrong. I have ‘real’ and ‘fake’ duplicates in my DB, and it irks me that I have no way of cleansing the ‘real’ ones without reviewing them by hand.

I absolutely understand the matching assistants, and think they are a superb function and a top feature of DT. However they serve an entirely different purpose, “close match” is not “identical”.

The matching assistant will help me identify things like duplicate scans, which is a brilliant feature of DT. And I use classify almost daily, it’s a top notch tool!
But I also know that I have a bunch of identical files (multiple downloads, or downloaded and emailed, or various other reasons).
And there is no way of filtering so that I can eliminate “real” duplicates without having to open and compare files by hand. We have computers to do the comparing for us, so let’s use them!

MikeP · November 17, 2016, 11:28pm

OK, on reflection, here’s my suggestion:

Make a distinction between ‘probable match’ or ‘content match’, and ‘identical’.

Where a match is not a certainty, allow the user to unlink the duplicate so that they don’t appear int he duplicate lists for ever more as they currently do.

Gzk · November 17, 2016, 11:56pm

If I may interfere with a more practical question: Is there any way to undo/remove the duplicate status on files when there is no identity between them?

Bill_DeVille · November 18, 2016, 12:33am

No.

But there can be indicators of files whose content is really identical. If the filetype, file size and modification date of two files marked as duplicates are the same, they are likely identical.

If you select a document and choose Data > Duplicate, the duplicate will have copy appended to the Name. The creation and modification dates, filetype and file size will be the same. They are identical.

If you select a document marked as a duplicate and press Shift-Command-I to open its Info panel, you will see a field stating the number of duplicates. Click on that and the Names of the other duplicates will be displayed, and clicking on one reveals it in DEVONthink.

MikeP · November 18, 2016, 12:56am

Gzk and I are talking about the exact opposite: Disconnecting ‘duplicates’ that are different.

Real life example: I have two copies of a form filled in with different content. DT tells me they are duplicates. I want to tell DT to stop flagging them as a duplicates. They aren’t. Never will be.

Gzk · November 18, 2016, 9:56am

@Bill_DeVille: Thanks for your swift reply. As MikeP remarked, I also have false duplicates. Actually they couldn’t be more dissimilar. In fact, it’s a bunch of them (52), in some cases ‘triplicates’ of pdf made of camera shots that could not be OCR’d in a meaningful way (maps, drawings, etc). But they share one page or some text that indeed is common to all three, i.e. the TOC of a collection of articles. I need that TOC in all cases.

However, I am not addressing the coding / programming side of the issue. I just want to get rid of the confusing ‘duplicate’ badge on those dissimilar files. In the meanwhile, I am using the solution proposed here: Invalid duplicates. But that only helps at cleaning tasks in the Duplicates smart folder.

OogieM · November 21, 2016, 3:13pm

FWIW I’m getting far more false duplicates as well. In my case it’s SQL Queries, they are very similar but I too would like to unlink them as they are not actually duplicates. I agree that there needs to be some distinction between duplicate and similar items.

I have to admit that I never use the AI features of DT to sort, locate or file stuff so for me the AI nature of finding stuff that is similar isn’t really useful. I do however need to clearly know what is an exact copy and find those items. Especially since DT still on occasion duplicates nearly everything itself as part of some sync glitch.

BLUEFROG · November 22, 2016, 12:43am

Have you started a Support Ticket on this? I have seen no report of this.

OogieM · November 22, 2016, 3:12pm

Not recently, I just notice when it occurs and have been clearing the dups manually. It’s so intermittent that the last time I did report it Support couldn’t do anything. Seems to occur about once every couple of months or so. I will do a support ticket next time if that will help. I’m getting close to when it’s likely to occur again based on timing.

carlcasca · November 27, 2016, 11:03pm

Yikes, and thanks to the OP for pointing out this issue. I had always assumed duplicates meant exact duplicates, not just very similar. I guess there are cases when something as small as an extra space or a comma might not matter, but there are cases where a very tiny difference matters a lot. I use DTPO as a lab notebook. I often store data runs that will have 50000-100000 identical values except perhaps one difference in a single digit. That difference matters a lot. Both versions are equally important. I’d suggest that if it’s too cpu expensive to identify things by whether or not they are true duplicates, then the term should be changed. “Similar,” “Potential Duplicates,” something else? In any case, DTPO shouldn’t use the term duplicate for what it’s doing because that word has an expected meaning to a user, and DTPO is not following that convention. I’m a long-time user and really do appreciate how nice DTPO has become over the years. The OP has an excellent point.

daniel1113 · November 29, 2016, 3:29pm

carlcasca:

Yikes, and thanks to the OP for pointing out this issue. I had always assumed duplicates meant exact duplicates, not just very similar. I guess there are cases when something as small as an extra space or a comma might not matter, but there are cases where a very tiny difference matters a lot. I use DTPO as a lab notebook. I often store data runs that will have 50000-100000 identical values except perhaps one difference in a single digit. That difference matters a lot. Both versions are equally important. I’d suggest that if it’s too cpu expensive to identify things by whether or not they are true duplicates, then the term should be changed. “Similar,” “Potential Duplicates,” something else? In any case, DTPO shouldn’t use the term duplicate for what it’s doing because that word has an expected meaning to a user, and DTPO is not following that convention. I’m a long-time user and really do appreciate how nice DTPO has become over the years. The OP has an excellent point.

Agreed. It boggles my mind that this is even up for dispute. As implemented, the feature is confusing because it goes against well entrenched conventions and user expectations. Change the description/name to “near duplicates” or the like and all will be well.

Shoolie · November 30, 2016, 3:59am

I would find it helpful if the DEVON team could explain why DT declares not-identically-duplicate records as duplicates – is it a side effect of some aspect of the AI that is triggered by certain data or threshold of “sameness,” is there some use case where this behavior is useful, is there some other explanation?

OogieM · January 8, 2017, 6:33pm

Just happened again and a support ticket has been sent in.

tgunr · November 30, 2020, 2:10pm

Could not agree more! The dictionary even point it out, duplicate item are NOT “similar”.

du·pli·cate
adjective | ˈd(j)upləkət | [attributive]
1 exactly like something else, especially through having been copied: a duplicate license is issued to replace a valid license which has been lost.
2 technical having two corresponding or identical parts: a duplicate application form.

noun | ˈd(j)upləkət |
1 one of two or more identical things: books may be disposed of if they are duplicates.
• a copy of an original: locksmiths can make duplicates of most keys.
2 short for duplicate bridge.
3 archaic a pawnbroker’s ticket.

verb | ˈd(j)upləˌkeɪt | [with object]
make or be an exact copy of: they have not been able to duplicate his successes | a unique scent, impossible to duplicate or forget.
• make or supply copies of (a document): information sheets had to be typed and duplicated | (as adjective duplicating) : a duplicating machine.
• multiply by two; double: the normal amount of DNA has been duplicated thousands of times.
• do (something) again unnecessarily

sim·i·lar | ˈsɪm(ə)lər |
adjective
resembling without being identical: a soft cheese similar to Brie | northern India and similar areas.
• Geometry (of geometric figures) having the same shape, with the same angles and proportions, though not necessarily of the same size.
noun
1 mainly archaic a person or thing similar to another: he was one of those whose similar you never meet.
2 (usually similars) a substance that produces effects resembling the symptoms of particular diseases (the basis of homeopathic treatment): the principle of treatment by similars.

BLUEFROG · November 30, 2020, 3:20pm

The flexibility of language allows for nuanced meaning in terms. For example, I can say, “I love that woman!” and “Man, I love steak!” and it’s understood I am not romantically interested in the steak but I am in the woman.

We have to choose words that are commonly known, and try not inventing our own terms or lean on contrived terms like “Kinda Similar”.

Also, in Preferences > General, you can select the Stricter recognition of duplicates to use the file type and size to detect duplicates.

konterbande · November 30, 2020, 7:35pm

I love your simile, in any non romantic way.

BLUEFROG · November 30, 2020, 8:24pm

Hahaha!
/blush