Help needed: Highlighting text in PDFs seems to mess up OCR'd text

xurc · October 25, 2020, 8:16am

Hello,

Recently I started annotating PDFs in DEVONthink. I just noticed today that the OCR’d text became messed up:

Notice that in the screenshot,

The Content column in the inspection pane displays all highlighted text as question marks. When I copy the text from the PDF and paste it elsewhere, the result is messed up too: 􏱊􏱨􏱈􏰙􏱈􏰚
The word count of the document is only 36 words, which belong to the comment I added while annotating

I opened the original PDF (stored outside of DT) and didn’t have those issues,

The PDF was correctly OCR’d. When I copy the text and paste elsewhere, it was displayed correctly.
I imported this PDF into DT again and the word count was around 18,000.

Next, I tried opening the original PDF in PDF Expert for macOS. After adding highlights and comments, I closed the application and opened it again. I could still copy & paste the text correctly.

After those tests, I decided the issues were probably either caused by how DT handled PDF annotations or how I operated DT, so I came here for help. Why is the OCR’d PDF messed up in DT and how do I avoid it in the future? Thanks in advance

rfog · October 25, 2020, 9:34am

I’m not sure if it is related, but sometimes I’ve found that some well recognized OCR PDFs go berserk once you split/add pages, or sometimes on annotation if they have some kind of “security” enabled (password for copy or print, etc) and the program you use does not honors those things.

I think DT honors those stuff, and surely should be generated outside.

xurc · October 25, 2020, 9:47am

@rfog - thanks for sharing your experience!

It sure looked like the OCR’d text layer was completely messed up, although the only two things I did to the PDF were adding highlights and adding comments.

The PDF I was working on didn’t ask for a password when I opened it and afaik there’s no security measures in place.

rfog · October 25, 2020, 10:48am

The PDF I was working on didn’t ask for a password when I opened it and afaik there’s no security measures in place.
[/quote]

It does not need to ask for anything. If the program does not honors, asks nothing and do whatever you tell to do ignoring all security, but then the certificate is broken and the text becomes garbage.

xurc · October 25, 2020, 11:08am

Oh I see. I don’t think I’ve ever worked with protected PDFs and have little idea how they work. Could it be that this specific PDF is protected but I’m not aware of it, and DT happens to not honor the security certificate? I have no idea

Blanc · October 25, 2020, 11:21am

Have you actually OCRd these files (as suggested by your use of the term “OCR’d text”, or are you working with files which have come to you with a text layer already in place?

xurc · October 25, 2020, 11:37am

Hi @Blanc , thanks for helping.

I didn’t OCR the document by myself. It already had a text layer when I first received it, but I could tell it was a scanned and OCR’d document because of the image quality.

rmschne · October 25, 2020, 11:53am

some pdf writers have a setting to prevent changes on open, but don’t ask for password. so lack of password req’d on open could mean nothing.

Maybe “re-print” the PDF with your own software (Preview, DEVONthink, etc.).

grosson · October 25, 2020, 12:21pm

Maybe your experience is similar to what is described in this thread: PDF highlighting and file size

In which case, you are running into a known issue for which there are, as of yet, no elegant solutions.

xurc · October 25, 2020, 12:35pm

Thanks for the tip. I tried the following steps:

Open the original PDF in Preview
In Preview, go to File > Print… > Save as PDF
Open the reprinted PDF and verify the integrity of the text layer

And immediately the text layer was already messed up.

xurc · October 25, 2020, 12:47pm

Thanks for directing me to that post! While I did not encounter the issue where PDF size increases drastically, I do suspect I’m experiencing a similar issue, especially this reply from Christian:

I guess I’ll avoid using PDFkit for now by annotating my OCR’d PDFs in PDF Expert for macOS or DEVONthink To Go, as noted by your reply in the other thread (thanks!).

rfog · October 25, 2020, 12:47pm

Best solution for that is re-OCR inside DT or with other tool. I have Abbyy, both for Windows (that works waaaay better has a zillion options than macOS).

rmschne · October 25, 2020, 12:48pm

Well, since all that had nothing to do with DEVONthink, I can only assume the PDF is messed up (technical term), other software on your machine is messed up, or the original creator of the PDF didn’t want you to do this.

xurc · October 25, 2020, 12:53pm

If this issue is indeed caused by the PDFkit framework (as pointed out in grosson’s reply above), doesn’t that mean it doesn’t matter which application handles OCR as long as I end up annotating the PDF inside DT?

That’s news to me. I’ve always heard good things about the precision of Abbyy’s OCR but never knew about the lack of feature parity.

xurc · October 25, 2020, 12:57pm

I think we can safely exclude the last possibility, since I was able to annotate this specific PDF in PDF Expert without messing up the text layer. As noted in grosson’s reply above, perhaps the culprit is Apple’s PDFKit framework.

rfog · October 25, 2020, 1:10pm

Issue is caused by PDFkit because original PDF had security crap (DRM) enabled. Once done a normal OCR over the document, all that crap is done.

Not only less options, but even more buggy. For example, macOS lacks option to select what version of PDF/A you want, or add the extra fine font tune (I don’t remember the name now) over MRC. For recognition, it has more options like detection of TOC.

And just now I’m OCR of first volume of 11th Britannica edition. macOS crashed, and before crash it complained about internal errors and about lack of scan quality, but Windows version is doing with only 3 warnings about language recognition.

Sometimes macOS version asks for a higher resolution scan, say 600 DPI. I do it and then lacks fo too much resolution, do it at 300, then it lacked of low resolution, do it at 600…

rmschne · October 25, 2020, 1:10pm

Maybe if you have a colleague who has Windows and Adobe Acrobat (or similar), you could try there to get a clean OCR’ed version. Or there are Mac PDF products (but may depend also on PDFKit). Or Linux tools (but have forgotten more than I ever knew about Linux infrastructure). Stretched to end of my list of ideas.

rfog · October 25, 2020, 1:12pm

If it not illegal document nor private one, I can do the OCR for you (send me privately), but perhaps you can try from DT itself, right click -> OCR -> to searchable PDF.

xurc · October 25, 2020, 1:23pm

Ah thanks for the explanation, now I totally see why it could potentially solve the issue. I’ll re-apply OCR to it and see if it works!

xurc · October 25, 2020, 1:25pm

That’s quite a handful of ideas! I guess I’ll start from the basic (re-apply OCR in DT) and work my way up your list of ideas