Too many false duplicate PDF files

I appreciate that devonthink doesn’t base its duplicate recognition stupidly on modification dates, file sizes or checksums. However the duplicate detection AI is way to stupid when it comes to completely obvious file differences.

I have lots of PDF files that are semantically spoken image files: they mainly consist of PDF objects (vectors) and/or raster graphics. Often they contain also key numbers (those circled numbers that point to a certain image element), i.e. the only text in the files are the key numbers. (See screenshot)

The problem – you already guess it – is that DT declares all the files that contain for example exactly the key numbers “1” and “2” to duplicates, completely regardless of file size, PDF page size or PDF object / raster graphic content. As result DT presents me two dozens files that have absolutely nothing in common (except for their “text” content) as duplicates. This makes the duplicate detection useless or at least very dangerous when doing a clean-up.

There really should be some AI logic that not only takes account of the text content, but also of other essential criteria like PDF object content or PDF page size etc.

At least there should be a possibility to unmark duplicates manually if the DT “AI” fails.

Screen Shot 2012-07-16 at 15.29.04.png

DEVONthink compares either the text or a thumbnail (if there’s no text) and the number of pages. The thumbnail might be identical in this case. Could you please send some examples to cgrunenberg - at - devon-technologies.com? Thanks in advance!

I’ve sent you 3 sets of “duplicates”. They differ in content, PDF page size, PDF metadata (PDF producer, PDF version, …), thumbnails, …. If you need more, I have plenty of them :wink:

As already mentioned, a quick fix (and a very useful one) would be the possibility to force unmark files as duplicates.

(I know, I could add some invisible text to each file to make them distinct for DT, but this is too time-consuming.)

Thank you for the examples, the next release will fix this.

After a first look, it seems to work properly now with the new version (2.4).

Thank you!

EDIT:

I have to correct myself.

The samples I sent last you last time are correctly recognized as non-duplicates now. That’s fine.

But I just created a new database for some client projects (layout/design work), and I got false PDF duplicates again. However the case is different this time:

The false duplicates

  • do have identical text content
  • do have identical image content
  • do have the same number of pages and same page size
  • do not have identical file sizes
  • do not have the same fonts, font sizes and text arrangement (!)

I don’t know if you consider the ignoring of different fonts / text arrangement as WAD. In fact this behavior may be OK for strictly technical documents where pure content is the only thing that really counts. But in this case (design work) different fonts and text arrangement in the PDFs are the crucial criterion that makes the files different.

It would be very useful if DT were able to recognize these things as different. At least in the use-case of design / layout work.
(If you think that font and text arrangement differences should not be sufficient to declare files as different, you could make the behavior user selectable via preferences.)
Screen Shot 2012-08-09 at 18.10.58.png

Only contents (image, text, number of pages) are important and indexed but…

…these properties do not matter actually. Right now it’s not a recognition of duplicate files but of duplicate contents.

I have some youtube videos saved as web archives, and they show up as duplicates, because they’re the same size and have the same thumbnails (i.e. black).
Is there a way to remedy this?

What exactly does the web archive contain? Only the video? Or the complete page? In that case it should probably not be marked as a duplicate.

Just the video.
For youtube videos, I add “_popup” into the URL field, so I get something like youtube.com/watch_popup?v=oHg5SJYRHA0 as an URL, and then file it via the bookmarklet.

One workaround would be to add a dummy text (e.g. the title) to the web archives. Or you could store just the video file instead of a web archive, e.g. by downloading it with DEVONagent Pro:

  1. Open the page in DEVONagent Pro
  2. Open the “Objects” pane (see Web menu)
  3. Select the “Tubes” scanner
  4. Option-click on “videoplayback”

The titles are all different anyway, but the video download works. I’ll keep that in mind, thanks a lot.