Invalid duplicates

DCBerk · July 23, 2013, 4:39am

OK, weird. I have two files, in two different folders, that show up in the Duplicates Smart Folder, but they are not actually duplicates. They have two different filenames, and entirely different (but similar) content.

If I search for either filename, it shows up alone as a result.

I exported them, deleted them, deleted from Trash, closed the database, closed the app. Then reopened, imported the two files individually into two different folders. They still show up as duplicates. I ran Verify and Repair; everything seems fine. I could ignore it, but now I’m too curious; any ideas?

Bill_DeVille · July 23, 2013, 12:09pm

In DEVONthink, ‘duplicate’ = ‘high degree of similarity’ in the content of two or more documents. Only the documents’ content is evaluated, Names and metadata are not evaluated.

DCBerk · July 23, 2013, 3:38pm

Hi Billl
If problem due to Duplicate judging on high degree of similarity in content, not metadata, it would have flagged way more than two files. Actually, it did that momentarily and then corrected, leaving only these two.

The fix this time was to export one of the files, delete, delete from Trash, Quit, reopen, and then make a new file with the same name and the same content - it was no longer seen as a duplicate.

Think something is wrong with the db although V&R says no errors. Also having a weird problem with tagging - see my post on that topic. Rebuilt the file but still having problems.

iHuman · July 23, 2013, 9:50pm

I’m having this same issue with several documents. Is there a way to mark them as ‘non-duplicate’?

thanks!

iHuman · July 23, 2013, 9:54pm

Ok - I just came up with a work around for anyone interested.

If you have a problem where a document is showing up as a duplicate - and it is not, add the tag ‘false’ to the document.

Then in your Duplicates smart folder, add the rule:
Tag is not false.

Hope that helps,
elaine

iHuman · July 23, 2013, 10:00pm

A better workaround …

the work around I posted above will fix the issue of it showing up in the duplicates folder.

BUT… it will still be colored blue - thinking it is a duplicate.

So a better would be to open the document that DevonThink is a duplicate, and add an annotation or a note. i.e. - a note that says it is not a duplicate. This will change the file size of the document, so DevonThink will no longer classify it as a duplicate.

Hopefully the programmers will come up with a simple solution to be able to right click and declare it as ‘not a duplicate’ — which would resolve the issue.

FWIW,
elaine

Greg_Jones · July 23, 2013, 11:59pm

I’d anticipate that solution might cause more problems than it would resolve. Would right-clicking on a document and declaring it a duplicate also change the status of the corresponding false positive document to ‘not a duplicate’? What if one or more of the documents are indeed a duplicate-three documents total, two of them are actual duplicates, and the third is a false duplicate? What if documents were declared ‘not a duplicate’ and then in the future they actually become a true duplicate-should DEVONthink change the status back to ‘duplicate’? If so, does it now show as a false duplicate again to the original, problematic false duplicate document? I’m afraid there are no simple solutions here when it comes to false duplicates.

iHuman · July 24, 2013, 6:24am

Here is an example of what you could do based on adding two additional right click menu items ‘not duplicate’ and ‘display all duplicates’

duplicate smart folder lists document titled MyDocA as a duplicate (current feature in DevonThink)
right click on MyDocA to select ‘display all duplicates’
resulting lists shows 4 documents that DEVONthink has defined as duplicates
paths for each document are displayed
MyDocA /path/MyDocA
MyDocB /path/MyDocB
MyDocC /path/MyDocC
MyDocD /path/MyDocD
user compares document compares all 4 instances and confirms that indeed, MyDocA and MyDocC are duplicates, but the other two are not
right click on MyDocC -> delete (it is a duplicate of MyDocA)
right click on MyDocB -> select ‘not duplicate’
right click on MyDocD -> select ‘not duplicate’
do nothing to MyDocA

FWIW,
elaine

Greg_Jones · July 24, 2013, 12:27pm

It’s becoming less of a simple solution, would you agree?

I can understand how this hypothetical workflow could work, although it still seems complicated and fussy. Also, it doesn’t address some of my original concerns, such as what happens to a document in the future that has been marked ‘not duplicate’, when it actually becomes a duplicate of another document? Perhaps where edits have made MyDocA an actual duplicate of MyDocD?

Perhaps this is indeed a greater concern for some than it is for me, so help me understand better-what do you want to accomplish by using this smart group anyway? I understand that it is created by default in new databases, but I always considered it useful for identifying where a large number of documents have been duplicated and/or re-imported into a database by mistake, rather than a tool to evaluate the list on a document-by-document basis with a goal of reaching ‘Duplicates=0’.

BTW, if you want to use the tag is not false solution, you can turn off coloring of replicants and duplicates by unchecking Preferences>General>Mark duplicates and replicants in color’.

DCBerk · July 24, 2013, 7:31pm

I agree – to me, any file that appears in Duplicates is simply a heads-up that I have a redundant file and need to do some “housekeeping” – it makes it easy to find the copy I don’t want and delete it.

However, the reason I started this thread was that I found files in the Duplicate Smart Folder that were NOT duplicates – only one of each showed up, not two as is normal. When I checked by using Find, only one copy of each file showed up.

Clearly something odd about getting false duplicates and it seems tied to a related problem that is being discussed in another thread.

I recently decided to use the duplicate function as a way of temporarily sorting some files into a new folder to work on them. However, the function duplicated THE CONTENTS as well as the file. Really bizarre. Still don’t have an explanation for that either (and it doesn’t always happen), but suspect there is either something wrong with the command in the latest updates, or my db is corrupted in some way even though Repair & Verify says it is OK.

And, by the way, I found that using Tags to temporarily sort a topic works better than Duplicates in any case. Just make a Tag for whatever the topic is (I put an asterisk before the tagname so it would rise to the top of the hierarchy in the Tag list). Tag all the files you want to sort, and when you’re done, delete the Tag and poof, all gone. Really easy, and no extra files in the db.

Greg_Jones · July 24, 2013, 8:33pm

What do you mean ‘only one of each showed up’? How did you narrow this down using Find? Documents are marked as duplicates by the contents of the documents, not the file name. The most reliable way to see what documents are flagged as a duplicate match is via the Instances field of the Info pane. The See Also & Classify panel is good also, as the duplicates have the highest score in the See Also section.

DCBerk · July 24, 2013, 10:30pm

I had checked the Info Panel before, but did it again. These files were in my Garden db, and the duplicates were two rtf files, each with a single image of a different tree. However, this time I noticed that both had the same two-word caption under the image – the name of the nursery they were from. Bingo; deleted the caption under one and it was no longer a duplicate!

So problem solved; thank you.

However, it’s a bit of a nuisance to have to work around this hyper-sensitivity in the duplicate function. It’s sort of like having two photos, both captioned “Aunt Mary”, or whatever, taken two years apart, put in different folders, and they show up as duplicates.

One solution is to always use a more extensive text caption in a file to make it unique, but I am also finding Tags more useful lately, and maybe I should get in the habit of using those more often instead.

Bill_DeVille · July 24, 2013, 11:00pm

Don’t assume that files marked by DEVONthink are identical copies, as would be the case in the Finder. That assumption can lead to lost data.

Example: I have a searchable PDF. I select it and choose Data > Convert > to plain text. A new text file containing only the text content of the PDF is created, and both the PDF and the text file are marked as duplicates, because the text content of both is highly similar.

Example: A colleague used to send emails that contained only his greeting and signature text in the body, and the actual message was contained in a Word attachment document. As the text content of all his messages was identical, all of them were marked as duplicates.

iHuman · July 24, 2013, 11:46pm

Personally, I still think finding legitimate duplicate files - and not false duplicates - is something that could be implemented in the program by the developer. After all, how do programs like kaleidoscopeapp work? Seems like either the match algorithm isn’t strong enough - or there needs to be user features for choice making.

As to your comment asking if a document were marked ‘not duplicate’ what would happen in the future if it became an actual duplicate. My guess is it would be a relational database function. When document MyDocA was compared to MyDocB and MyDocA was marked ‘not duplicate’ - it is in relation to MyDocB - vs ever document.

for example: database table named false_duplicates
col 1: id (auto increment)
col 2: from
col 3: to

in the above example a row entry would read:

234 MyDocA MyDocB

So the next time DevonThink determines MyDocA to be a duplicate, it firsts runs a query against this table. If there is a row indicating the two docs are false duplicates, it ignores it in the Duplicates smart folder and does not color code it as a duplicate.

FWIW,
elaine

DCBerk · July 25, 2013, 12:53am

Hi Bill:

As you can see from my last post, I now know to be much more thorough when checking the Info Panel of anything that shows up as a duplicate. (Good thing I don’t empty the Trash very often, and usually check everything again before I do.)

I’ve always used the Duplicate Smart Folder as a heads-up to check for redundant files, but its sensitivity fills it with old news, as it were, making any new information less obvious. I have one DB with a lot of images in several versions – jpg, tif, fewer or more pixels, etc., and nearly every file shows up in Duplicates – what good is that?

I think it would be helpful to have the option of disabling a Duplicate classification for a specific file with a Menu command or in the Info Panel if it wasn’t seen as useful. After all, there is still the option of using See Also & Classify, which picks up any content that is even remotely similar and has the added advantage of sorting on the strength of the match.

On occasion, I have been grateful to discover two files in the Duplicates Smart Folder with identical content, one an email and one saved as txt, and I could then decide which one to keep. But depending on the purpose of the DB, it might be helpful to have the option to exclude certain filetypes from duplicate classification as the default (like a jpg and a tiff) – then toggle it on again if you wanted to see what turned up.

korm · November 19, 2016, 12:44am

There is a weird case – that’s related to the false duplicates issue, in opinion, where frequently the “duplicate” that DEVONthink reports in Show Info is the exact same file that is selected in the database. In other words – there’s no duplicate anywhere else. There’s only one file and DEVONthink falsely flags it as a duplicate. (Duplicate of itself, I suppose.)

Here an example: One file in this group. Show Info reports that the duplicate is that file itself. Click on either of these choices in Show Info and you’re taken to that same file.

(Before someone tells me to do V&R and Rebuild the database, I do that frequently already.)