Duplicate content detection?

JohnAtl · January 11, 2021, 10:20pm

Would it be possible to add duplicate detection based on contents, rather than type+size?

It would be nice if this included metadata as well. For example, if I have two files and one has a URL and the other doesn’t, they wouldn’t be considered duplicates.

BLUEFROG · January 12, 2021, 2:50am

Duplicate detection doesn’t consider size and type unless you’ve enabled stricter duplicate detection in Preferences > General.

Metadata isn’t part of the content of a file so it wouldn’t factor into a contents-based duplicate detection. Development would have to assess extensions to the detection mechanism.

JohnAtl · January 12, 2021, 3:18am

I must not understand how duplicate detection works then. I have four files, all different, and they are being marked as duplicates.

The contents of each is

[bookends](bookends://sonnysoftware.com/77131)

and the number at the end varies, depending on the reference.

BLUEFROG · January 12, 2021, 3:42am

Check the Instances dropdown in the Info inspector to see where the other duplicates of each file are

rfog · January 12, 2021, 9:18am

I have some little amount of files (PDF) that are completely different inside and they are marked as duplicates if I have disabled “strict” checking. However, I’m interested in this “soft” way to check duplicates because it is able to find very similar files that really are a duplicate that I want to get rid.

My solution is to have a tag called .false_match and then modify the duplicate smart group to ignore dulicates with that tag.

(BTW, I use point-starting-tag to indicate that is a “system” tag and not a normal one).

JohnAtl · January 12, 2021, 5:12pm

I know where they are. The content is different. That is the issue.

Screen Shot 2021-01-12 at 12.10.03 PM

BLUEFROG · January 12, 2021, 5:19pm

The rendered content is not different in this case. Both documents contain a single word: bookends. If the source of the document was not showing, it would certainly appear the files have the same content.

Development would have to assess modifying this behavior.

JohnAtl · September 5, 2022, 12:54pm

I still think this is ridiculous, and is one of the reasons I stopped using DEVONthink.

BLUEFROG · September 5, 2022, 1:38pm

We have users who expect this specific behavior, so I wouldn’t call it ridiculous on their behalf.

In the Help > Appendix > Hidden Preferences, click the On link for IndexRawMarkdownSource, then do a File > Rebuild Database.