Duplicates

DCBerk · November 10, 2015, 12:36am

I check my Duplicates Smart Folder from time to time and eliminate unnecessary redundancy. I first look at both files to check if they really are duplicates, and if so, I check the location in the file path to decide which one to delete

However, from time to time DT marks something as a duplicate that isn’t – similar, but not really the same. Is there any way to correct this – i.e., unmark it so it isn’t seen as a duplicate anymore?

Thanks,
DCB

BLUEFROG · November 10, 2015, 2:51am

Nope… blog.devontechnologies.com/2014/ … evonthink/

DCBerk · November 10, 2015, 2:59am

Succinct. Quick. Thanks.

Smitty · November 29, 2015, 9:45pm

Hi BLUEFROG. Timely. I tried the suggestions in the link and they did not work

I have imported a number of canceled checks. They have different check #'s but DTPO see’s them as duplicate. I tried annotating but that didn’t change anything. Would appreciate any suggestions.

Thnx!

Smitty

BLUEFROG · November 30, 2015, 1:39am

This would not be a sufficient different to unmark them as duplicates. If one word of 100 is different, that’s an apparent duplicate. Similarly, ten of 100 would likely be too.

Annotating doesn’t change the contents of the document.

DCBerk · November 30, 2015, 3:37am

Hi Bluefrog. I understand you are informaing us that this is how the software works. However, you seem to be dismissing this as an issue – perhaps the discussion belongs in suggestions.

I have two graphic files that have identical layouts – same logo on each, but one has a solid color cartouche around it. To a human, they are instantly recognized as different – 1 word, or 10 words in a 100 is not a factor.

Perhaps I shouldn’t speak for Smitty, but I think we’d both like to have the option to deselect the duplicate designation when it’s incorrect – basically the ability to tell the DB “no, these are not duplicates, you’ve made a mistake”.

Of course I don’t know what it would take to accomplish this from the DT perspective, but to a user it would seem to be no more than a keystroke or the ability to correct it in the Info panel.

Just sayin’,
June

FROBGOBLIN · November 30, 2015, 4:47am

DCBerk:

Hi Bluefrog. I understand you are informaing us that this is how the software works. However, you seem to be dismissing this as an issue – perhaps the discussion belongs in suggestions.

I have two graphic files that have identical layouts – same logo on each, but one has a solid color cartouche around it. To a human, they are instantly recognized as different – 1 word, or 10 words in a 100 is not a factor.

Perhaps I shouldn’t speak for Smitty, but I think we’d both like to have the option to deselect the duplicate designation when it’s incorrect – basically the ability to tell the DB “no, these are not duplicates, you’ve made a mistake”.

Of course I don’t know what it would take to accomplish this from the DT perspective, but to a user it would seem to be no more than a keystroke or the ability to correct it in the Info panel.

Just sayin’,
June

My guess would be that every re-indexing would find it as a duplicate. It might be best, perhaps as a workaround, to just flag it, tag it, or put something in the name to distinguish it from the others. Change the smart folder settings to filter those out, and they won’t appear.

In theory, it seems fine to let us manually inform the software if something really is or isn’t a replica / duplicate. In practice, though, if it would require a lot of back end work, it seems better to just let it go, and stick to the workarounds for the automation. I suppose only the developers know if the man hours would be a good investment or not.

BLUEFROG · November 30, 2015, 5:33am

Exactly. A computer is decidedly not human. Also, you are talking about a visual appearance. Duplicates are gauged by textual content, not graphic content. These are two very different things.

If I typeset a book in Franklin Gothic, then make a copy - changing the font to Garamond Condensed, the book is still textually a duplicate. Even if I put different cover art on them - they are still textually duplicates.

Duplicate image detection is not text-based and is very different technologically than what DEVONthink is doing.

DCBerk · November 30, 2015, 5:43am

"

There is some comfort in that.

I have not yet encountered this, but are you saying that if I have a database with several images that are only slightly different, the database will see them all as duplicates?

And what if I wanted to distinguish them? Instead of leaving them as jpgs would I have to put each in an .rtf file and add a caption?

June

Smitty · November 30, 2015, 11:53am

Thanks for the input y’all. I agree, if it were not a development nightmare, having a way to tell DTPO that an item was not a duplicate–and maybe through the AI over time, it would learn that the given type of item was not a duplicate (or could ask)–that would be great. In the meantime, short of any other ideas, I guess I will have to set up a keyword. Thanks again for the support and ideas.

BLUEFROG · November 30, 2015, 1:40pm

Running a quick test by adjusting the color of a PNG file, it appears that pure raster images may be doing a byte comparison.

nishiazabu · December 3, 2015, 3:07am

Perhaps it is more precise to indicate that duplicates in the Devonthink context only refers to a text based comparison where a [somewhat arbitrary] certain level of textual similarity defines a duplicate. But remember that a DTPO duplicate is NOT an identical version nor necessarily a close visual analogue.

DCBerk · December 3, 2015, 6:07am

May I be frank: all of this detail about how the program identifies duplicates is really of minimal interest when it doesn’t go anywhere. From my perspective, a misidentified duplicate that I can’t correct or delete is useless information.

So my solution is to delete the Duplicates Smart Folder. I can reinstate it from time to time when I want to do a little housekeeping and weed out the duplicates that are correctly identified.

June

BLUEFROG · December 3, 2015, 2:39pm

Always.

And this is from your perspective, if I may be frank in return. The human and computer perspective are not always the same. The computer has to operate within far more concrete parameters than the flexible human mind can. So “misidentified” is a matter of point of view. That being said, we are always looking at ways to bring these perspectives closer when possible.