Help needed: Highlighting text in PDFs seems to mess up OCR'd text

Have you actually OCRd these files (as suggested by your use of the term “OCR’d text”, or are you working with files which have come to you with a text layer already in place?

Hi @Blanc , thanks for helping.

I didn’t OCR the document by myself. It already had a text layer when I first received it, but I could tell it was a scanned and OCR’d document because of the image quality.

some pdf writers have a setting to prevent changes on open, but don’t ask for password. so lack of password req’d on open could mean nothing.

Maybe “re-print” the PDF with your own software (Preview, DEVONthink, etc.).

Maybe your experience is similar to what is described in this thread: PDF highlighting and file size

In which case, you are running into a known issue for which there are, as of yet, no elegant solutions. :confused:

Thanks for the tip. I tried the following steps:

  • Open the original PDF in Preview
  • In Preview, go to File > Print… > Save as PDF
  • Open the reprinted PDF and verify the integrity of the text layer

And immediately the text layer was already messed up. :sweat_smile:

Thanks for directing me to that post! While I did not encounter the issue where PDF size increases drastically, I do suspect I’m experiencing a similar issue, especially this reply from Christian:

I guess I’ll avoid using PDFkit for now by annotating my OCR’d PDFs in PDF Expert for macOS or DEVONthink To Go, as noted by your reply in the other thread (thanks!).

Best solution for that is re-OCR inside DT or with other tool. I have Abbyy, both for Windows (that works waaaay better has a zillion options than macOS).

Well, since all that had nothing to do with DEVONthink, I can only assume the PDF is messed up (technical term), other software on your machine is messed up, or the original creator of the PDF didn’t want you to do this.

If this issue is indeed caused by the PDFkit framework (as pointed out in grosson’s reply above), doesn’t that mean it doesn’t matter which application handles OCR as long as I end up annotating the PDF inside DT?

That’s news to me. I’ve always heard good things about the precision of Abbyy’s OCR but never knew about the lack of feature parity.

I think we can safely exclude the last possibility, since I was able to annotate this specific PDF in PDF Expert without messing up the text layer. As noted in grosson’s reply above, perhaps the culprit is Apple’s PDFKit framework.

1 Like

Issue is caused by PDFkit because original PDF had security crap (DRM) enabled. Once done a normal OCR over the document, all that crap is done.

Not only less options, but even more buggy. For example, macOS lacks option to select what version of PDF/A you want, or add the extra fine font tune (I don’t remember the name now) over MRC. For recognition, it has more options like detection of TOC.

And just now I’m OCR of first volume of 11th Britannica edition. macOS crashed, and before crash it complained about internal errors and about lack of scan quality, but Windows version is doing with only 3 warnings about language recognition.

Sometimes macOS version asks for a higher resolution scan, say 600 DPI. I do it and then lacks fo too much resolution, do it at 300, then it lacked of low resolution, do it at 600…

1 Like

Maybe if you have a colleague who has Windows and Adobe Acrobat (or similar), you could try there to get a clean OCR’ed version. Or there are Mac PDF products (but may depend also on PDFKit). Or Linux tools (but have forgotten more than I ever knew about Linux infrastructure). Stretched to end of my list of ideas.

If it not illegal document nor private one, I can do the OCR for you (send me privately), but perhaps you can try from DT itself, right click -> OCR -> to searchable PDF.

Ah thanks for the explanation, now I totally see why it could potentially solve the issue. I’ll re-apply OCR to it and see if it works!

That’s quite a handful of ideas! I guess I’ll start from the basic (re-apply OCR in DT) and work my way up your list of ideas :smiley:

I had a huge PDF United Nations Document I tried over and over to OCR it. I think it was images of pages instead of real scan. On other docs like that DEVONthink and/or Preview got me thru troubles with at the UN document. But this and one other document had something in it that caused all tools to consume 22 gb of memory (on my 40 mb iMac) before simply stopping. Gave up. Reported to the vendors, whoever made the PDF in the UN needs some remedial computer training, I think.

1 Like

True, sounds like a horrible ordeal.

I cannot tell you how many documents like that I have had to work with. File arrives in HQ by mail, is printed out, stamped, scanned to PDF as image, sent on.

On the original topic (and I’m sorry I’ve joined the party so late), I’m pretty sure reapplying OCR will solve the problem (which was why I asked the question I did further up there somewhere). @xurc let us know whether you are successful :slight_smile:

Apologies for the delayed reply. I tried the following steps and indeed solved the issue:

  • Import the original PDF into DT again
  • Re-apply OCR to the PDF
  • Open the PDF produced by OCR and add some highlights to the text
  • Close the PDF and open it again
  • Verify that
    • I can still copy the content and paste it correctly elsewhere
    • The Annotations section of the Inspection pane can still display the highlighted content corrently

The only caveat is the size of the PDF produced by OCR, which went from 4.4 MB to 110 MB. I’ve encountered this issue multiple times and never figured out how to solve it, which is why I always shy away from OCR in DT whenever possible.

1 Like

Hold the Option key and choose Help > Report bug to start a support ticket. Zip and attach the original PDF for us to inspect. Thanks.