Running into quality issues after OCR

Hi all,

I’ve gotten up and running with DEVONthink Pro on the Mac in recent weeks and overall things are working great. However, I’ve run into two issues that’s prompting an adjustment that now appears to be causing a new issue. Here is what’s happening:

I get PDFs to review and annotate (almost entirely I’m doing highlighting) from lots of different sources for my work. I started noticing two issues with a few of the PDFs that I’ve imported into DEVONthink on the Mac:

  1. On a few PDFs, the annotations would come through (i.e. highlights show up in all the right places) but the actual text that shows up in the annotation window (or if I attempt to copy/paste the text from the PDF) is gibberish characters – despite appearing correct on screen inside DEVONthink and the PDF file itself.

  2. In some of these cases, I’m also unable to edit the author and title fields under “properties” for the PDF document in DEVONthink.

I saw a reference to at least one of the issues above on the forum here and the advice was to re-do OCR (even though they were OCR’d previously by another source) inside of DEVONthink. So, I did this.

Great news – in all cases, it resolves both problems above.

Bad news – in all cases, it appears to create a new problem.

The newly OCR’d PDF text now appears slightly fuzzy (i.e. the quality is noticeably reduced on the text). Wouldn’t be a huge issue if I was just using for archive reasons, but between the amount of reading I do on iPad and 40+ year old eyes, it’s a noticeable problem. =)

I have 300 dpi selected in my settings for OCR and have attempted OCR with both compression enabled and not enabled. Both appear equally fuzzy after the DEVONthink OCR (see attached examples attached). File size of the original PDF is 2.3 MB and the newly OCR’d with compression is 162.4 MB and the one without compression is 108 MB (also odd that the uncompressed version is smaller).

Guessing I’m missing something obvious since I don’t see any references here to anybody else having quality issues after OCRing inside DEVONthink. Suggestions on what to try?

Related issue – assuming we get the fuzzy issue addressed, is there a way to do this without massively increasing file sizes?

The tough variable here is that I get PDFs from all different sources that I don’t have control over. Almost always these are computer generated (i.e. not scanned) but the software that did the original OCR/creation could be anything since I work with lots of different stakeholders.

Thank you in advance for any insight!

Dave

Wasn’t able to attach the other two example above, so here’s after with no compression…

And after with compression…

FWIW I’ve noticed the same. I’m assuming it’s something to do with downsampling.

Same - it’s a real pain. I really hope Apple fixes this. :persevere:

Welcome @Dave614

The issue with the garbled text is a problem with certain fonts and Apple’s PDFKit framework.

I’m curious as the file should have been opened as read-only. Did you apply the hidden preference to open all PDFs as read-only?

1 Like

Thanks for the help!

Hidden preference inside of DEVONthink? Not sounding familiar, so probably not. How do I find/attempt? :pray:

See page 247 of the “DEVONthink Handbook” Version 3.8.3, section “Hidden Preferences” for the parameter “ForceEditablePDFs” which is explained there.

1 Like

Is the original problematic PDF, i.e., without annotations, publicly available from somewhere?

Thanks @rmschne for the tip.

I went ahead and enabled this and can’t seem to find the issue any longer on any of the documents I’ve got in the library, so this might have resolved. I’ll keep an eye on it as I start importing more. Big thanks! :pray:

Thanks for the follow-up on this, @BLUEFROG – sadly, no. However, looks like the ForceEditablePDFs may have resolved. I’ll keep an eye on it and, if it didn’t, find a document that I can share. Grateful for your help!

Realizing that I have one other related question on this, @BLUEFROG

Is my assumption correct the DEVONthink only runs OCR on import (if enabled in settings) if the PDF document being imported doesn’t already have OCR? This appears to be the case in my tests, but just wanted to check before enabled OCR on importing for everything. Thanks! :pray:

Afaik, DT never runs OCR on its own. You must somehow tell it to do so, for example with a smart rule or by triggering OCR manually.

Ahhh - so does the “convert incoming scans” in the OCR preferences do something else then?

The convert incoming scans preference will OCR things if they are scanned directly into DT by a supported scanner like the Scansnap. But it won’t convert things if the scans are just imported as PDF files.

Got it! Makes sense - thanks a ton for getting me up to speed on this. :pray:

As @chrillek mentioned, DEVONthink doesn’t do OCR on its own - but that’s during a non-scanning import, e.g., dragging and dropping a file into the Global Inbox.

And @alanshutko is correct: The Preferences > OCR > Convert Incoming Scans is used when receiving input from an external scanning application, like ScanSnap Home.

Brilliant – thank you all! :pray:

You’re welcome :slight_smile: