Duplicate not marked as duplicate

Is there a definition somewhere of what counts as a duplicate in non-strict mode, and then in strict mode? I.e. what the criteria are? I can’t seem to find that listed in the manual and it’d be really useful.

The thumbnail of images (or PDF documents without text) has to be identical, in case of PDF documents the page count of course too.

In case of text documents (including PDF+Text) the indexed text has to be the same (case insensitive, only alphanumeric characters), therefore e.g. different file formats containing the same text are marked as duplicates. Strict mode just compares the file size & type too.

I’m having an intermittant issue with duplicates not being marked. Here is a screenshot today of two different files, both pdf’s - one gets marked as duplicate and the other does not. I have not touched either of these files, both were imported in exactly the same way.

Happy to send the files.

FYI, Adobe Acrobat (and some online web sites) provide tools to show differences between two PDF files. Perhaps take these two files and compare and see results?

Thanks for this, but they are both exactly the same file - as in I have accidently downloaded multiple copies. Note the times. I have diff tools here that show the files to be identical.

Rebuilding the database fixes this pretty much everytime, but I’m having to do it far too often.

Are you able to reproduce this by importing certain files? Then a copy of the files would be great, thanks.