Detecting and resolving Duplicates needs an overhaul

[size=150]False Positives[/size]

If I take a rich text document, duplicate it into the same group, and then add the letter “a” to the beginning or end of the document, DT then designates both as being unique despite the fact that they both share a significant amount of unaltered text. I get that, the “content” is different.

However, I have two small video files (files.me.com/kioarthurdane/4qbtrx) that DT claims to be duplicates despite the fact they have completely different audio tracks and while the video tracks share similarities in the middle, there is enough difference to distinguish between the two based on video alone following the above append/pre-append demonstration. I consider these to be false positives and therefore a “bug” since the expected behavior is not seen.

In DT, do whole video frames not count in the evaluation of similarity? If so, then DT ignores the idea that “a picture is worth a thousand words” because apparently a single printable character is worth more than a few minutes of video and audio.

If a 2 minute video file can be confused with a 1 minute video file, then too should two text documents that share more than 99% content, word count/frequency/order.

Now, I do see the use for a fuzzy evaluation of duplicates because image files can be rendered into different file formats. However, I’ve yet to see DT pick out the same base image saved at two different pixel counts (by the way, images with dpi not equal to 72 are displayed incorrectly in the Width x Height column).

But I have also seen multiple FLV files of entirely different video and audio content which have similarly been flagged as duplicates. If DT uses metadata and tagged information within a file to determine similarity, what’s the point of having video/audio content at all? Why would the fact two files are so radically different in file size be ignored when considering the likelihood of similarity?

[size=150]Possible Solution?[/size]

Having “Duplicate” be a binary value of either “Is Duplicate” or “Is unique” does not give enough information when the comparison algorithm is kept secret or esoteric. Not knowing what DT will consider a duplicate or not makes me not want to use the feature at all since it’s so unpredictable.

Perhaps there should be a searchable column displaying the “Duplicate Confidence” to show when these files are more or less likely to actually be similar. Consider it in the same vein as the “Relevance” column displayed when searching (which is a form of comparison too!).

[size=150]Resolving Duplicates[/size]

Furthermore, when attempting to resolve duplicate issues, I end up spending most of my time asking the question “Which one do I want to get rid of? Which one has the better resolution/quality? Maybe I want to make a replicant in the other location so I can keep the folder/tag information.”

I would like to request that a UI be created for showing me what files are duplicates of other files, the ability to tell DT that it’s wrong and to not consider this one file or many to be duplicates, and to give the option to Keep One Copy (options to pick the newest, oldest, largest, smallest, best quality, etc) and yet another option to Replicate the Kept File into the deleted file’s previous locations.

[size=150]Replicants and Tagging[/size]

Similarly, it would be nice to see a way of identifying Replicants in a similar manner, especially when trying to resolve whether a file is the “original” or just a replicant. There have been a few occasions where when trying to eliminate tags on an file by deleting the file from it’s group under the Tags group, I have inadvertently removed the file all-together. I didn’t mean to delete the file, just clear out a given Tag.

I think having a UI for Replicants like I described above for Duplicates would assist in those occasions where replicants exist in many subgroups and subgroups of a single, deep group. Inherited tags should be “collapsable” so that replicants in parent Tags/Groups are minimized.

[size=150]Conclusion[/size]

I love the hierarchical and inherited tagging system that DevonThink has come up with, but I think it can be polished a bit more with some basic file handling mechanisms.

The detection of duplicate videos is simply based on the URL (assuming it’s downloaded from the Internet) and/or the thumbnail. Therefore just remove the thumbnail or use any frame for the thumbnail on your own (via the contextual menu of the video view).

Yes, that does work… But if I have hundreds of video clips I’m trying to organize, wouldn’t it be nice if DevonThink looked at data OTHER than the thumbnail and/or URL? Why trouble the user with patching the behavior of a black-box duplicate-detecting “AI” when that algorithm should be patched from the developer end. A false positive means the program is not working “as expected”, at least from the perspective of the End User.

Also, I’ve had many video files with different or no URL AND different thumbnails (admittedly, they were what DT set automatically) show up as “duplicates”. Had DT looked at say, the total time of the videos, the audio tracks, the file sizes,

FLV files seem to be popular offenders with this behavior. I’ve had files from YouTube and off YouTube being marked as “duplicates” of each other despite different URLs, file sizes, video resolution, and different thumbnails (cartoon vs live action… for example).

I must agree with kiodane to this!
I have experiment much with my databases the whole day.
I know I had some duplicate movie files that take much space from my harddrive and want to delete them.

I remove all thumbnails on the movie files and I know I had some duplicates on my harddrive. So I create a smartgroup to detect duplicates for movie files over 20 MB to see if the duplicate function do it’s job. Here comes the bad chock.

I had 4 clips (2 imported / 2 indexed) with same name, size and content but without any thumbnail. I try to search after duplicates and don’t find any files at all although I had 4 files with same name, size and content :open_mouth:

So I’m afraid that for now DT 2.0.1 doesn’t look for movie clips file name OR/AND size. It go just after the thumbnail if that is the same.

I think this behaviour could be similar to other kind of files too, but I haven’t test that through yet.
Edit: Have try to see some image files now and the duplicate function is the same here. For now It’s very hard to find duplicates without any thumbnails.

I really love Devonthink with all the strength and potential it has, but it doesn’t feel great with this type of false possitive behaviour for duplicate finding function in smart groups :frowning:

So please devontech team, correct this false possitive behaviour :smiley:

When I think about this finding duplicate function, I get an idea.

Is it possible to implement a function for duplicates in smart groups there the user can choose which attributes devonthink will search after on the files, like name, size, date, url, thumbnail and so on?

One another solution to find more duplicates for video files (without this added attributes in smart groups) is to have an option to choose where in the clips Devonthink will create the thumnails.

For the most clips, the movie could begin with a black screen or many movie files I have try to get some thumbnails from doesn’t give me any thumbnails at all and therefore I cannot find any duplicates for the moment.

Edit: After hours of trial end error. I think I finally have solved this thumbnail creating mystery.
After I installed Perian 1.2 I get nice thumbnails for all my video files when I choose thumbnail create in DT.

I get only few movie clips with black thumbnails and I found a nice solution to solve this too from another forum.

Just open the file in quicktime player, go to the frame you like and then press Command + c and go back to DT and choose info for the movie file and click on the little rectangle you can see upper left in the info window.
Then press Command + v and now you will see the new frame that you copy from the movieclip.

So now when I have created thumbnails for all movie clips I can only see a dozen duplicates of files, that is really duplicates and no false alarms anymore. jippie I’m so happy now :mrgreen:

Or just go the desired frame in DEVONthink and choose “Set As Thumbnail” in the contextual menu of the movie view.

I would have to agree. There’s quite a few cases where documents were scanned twice. With 2 scans you’ll always have a couple of characters being different.

What if you have ‘duplicate’ and ‘most likely duplicate’ or ‘pretty duplicate’ :wink:

Or as someone else suggested: configurable duplicate detection (filename, size, date, % of identical content, etc.)

Cheers
DJ