Duplicate content detection?

Would it be possible to add duplicate detection based on contents, rather than type+size?

It would be nice if this included metadata as well. For example, if I have two files and one has a URL and the other doesn’t, they wouldn’t be considered duplicates.

Duplicate detection doesn’t consider size and type unless you’ve enabled stricter duplicate detection in Preferences > General.

Metadata isn’t part of the content of a file so it wouldn’t factor into a contents-based duplicate detection. Development would have to assess extensions to the detection mechanism.

I must not understand how duplicate detection works then. I have four files, all different, and they are being marked as duplicates.

The contents of each is

[bookends](bookends://sonnysoftware.com/77131)

and the number at the end varies, depending on the reference.

Check the Instances dropdown in the Info inspector to see where the other duplicates of each file are

I have some little amount of files (PDF) that are completely different inside and they are marked as duplicates if I have disabled “strict” checking. However, I’m interested in this “soft” way to check duplicates because it is able to find very similar files that really are a duplicate that I want to get rid.

My solution is to have a tag called .false_match and then modify the duplicate smart group to ignore dulicates with that tag.

(BTW, I use point-starting-tag to indicate that is a “system” tag and not a normal one).

1 Like

I know where they are. The content is different. That is the issue.

Screen Shot 2021-01-12 at 12.10.03 PM

The rendered content is not different in this case. Both documents contain a single word: bookends. If the source of the document was not showing, it would certainly appear the files have the same content.

Development would have to assess modifying this behavior.