[size=150]False Positives[/size]
If I take a rich text document, duplicate it into the same group, and then add the letter “a” to the beginning or end of the document, DT then designates both as being unique despite the fact that they both share a significant amount of unaltered text. I get that, the “content” is different.
However, I have two small video files (files.me.com/kioarthurdane/4qbtrx) that DT claims to be duplicates despite the fact they have completely different audio tracks and while the video tracks share similarities in the middle, there is enough difference to distinguish between the two based on video alone following the above append/pre-append demonstration. I consider these to be false positives and therefore a “bug” since the expected behavior is not seen.
In DT, do whole video frames not count in the evaluation of similarity? If so, then DT ignores the idea that “a picture is worth a thousand words” because apparently a single printable character is worth more than a few minutes of video and audio.
If a 2 minute video file can be confused with a 1 minute video file, then too should two text documents that share more than 99% content, word count/frequency/order.
Now, I do see the use for a fuzzy evaluation of duplicates because image files can be rendered into different file formats. However, I’ve yet to see DT pick out the same base image saved at two different pixel counts (by the way, images with dpi not equal to 72 are displayed incorrectly in the Width x Height column).
But I have also seen multiple FLV files of entirely different video and audio content which have similarly been flagged as duplicates. If DT uses metadata and tagged information within a file to determine similarity, what’s the point of having video/audio content at all? Why would the fact two files are so radically different in file size be ignored when considering the likelihood of similarity?
[size=150]Possible Solution?[/size]
Having “Duplicate” be a binary value of either “Is Duplicate” or “Is unique” does not give enough information when the comparison algorithm is kept secret or esoteric. Not knowing what DT will consider a duplicate or not makes me not want to use the feature at all since it’s so unpredictable.
Perhaps there should be a searchable column displaying the “Duplicate Confidence” to show when these files are more or less likely to actually be similar. Consider it in the same vein as the “Relevance” column displayed when searching (which is a form of comparison too!).
[size=150]Resolving Duplicates[/size]
Furthermore, when attempting to resolve duplicate issues, I end up spending most of my time asking the question “Which one do I want to get rid of? Which one has the better resolution/quality? Maybe I want to make a replicant in the other location so I can keep the folder/tag information.”
I would like to request that a UI be created for showing me what files are duplicates of other files, the ability to tell DT that it’s wrong and to not consider this one file or many to be duplicates, and to give the option to Keep One Copy (options to pick the newest, oldest, largest, smallest, best quality, etc) and yet another option to Replicate the Kept File into the deleted file’s previous locations.
[size=150]Replicants and Tagging[/size]
Similarly, it would be nice to see a way of identifying Replicants in a similar manner, especially when trying to resolve whether a file is the “original” or just a replicant. There have been a few occasions where when trying to eliminate tags on an file by deleting the file from it’s group under the Tags group, I have inadvertently removed the file all-together. I didn’t mean to delete the file, just clear out a given Tag.
I think having a UI for Replicants like I described above for Duplicates would assist in those occasions where replicants exist in many subgroups and subgroups of a single, deep group. Inherited tags should be “collapsable” so that replicants in parent Tags/Groups are minimized.
[size=150]Conclusion[/size]
I love the hierarchical and inherited tagging system that DevonThink has come up with, but I think it can be polished a bit more with some basic file handling mechanisms.