Help needed: Highlighting text in PDFs seems to mess up OCR'd text

rmschne · October 25, 2020, 1:39pm

I had a huge PDF United Nations Document I tried over and over to OCR it. I think it was images of pages instead of real scan. On other docs like that DEVONthink and/or Preview got me thru troubles with at the UN document. But this and one other document had something in it that caused all tools to consume 22 gb of memory (on my 40 mb iMac) before simply stopping. Gave up. Reported to the vendors, whoever made the PDF in the UN needs some remedial computer training, I think.

xurc · October 25, 2020, 1:44pm

True, sounds like a horrible ordeal.

Blanc · October 25, 2020, 6:57pm

I cannot tell you how many documents like that I have had to work with. File arrives in HQ by mail, is printed out, stamped, scanned to PDF as image, sent on.

On the original topic (and I’m sorry I’ve joined the party so late), I’m pretty sure reapplying OCR will solve the problem (which was why I asked the question I did further up there somewhere). @xurc let us know whether you are successful

xurc · October 26, 2020, 2:21am

Apologies for the delayed reply. I tried the following steps and indeed solved the issue:

Import the original PDF into DT again
Re-apply OCR to the PDF
Open the PDF produced by OCR and add some highlights to the text
Close the PDF and open it again
Verify that
- I can still copy the content and paste it correctly elsewhere
- The Annotations section of the Inspection pane can still display the highlighted content corrently

The only caveat is the size of the PDF produced by OCR, which went from 4.4 MB to 110 MB. I’ve encountered this issue multiple times and never figured out how to solve it, which is why I always shy away from OCR in DT whenever possible.

BLUEFROG · October 26, 2020, 2:59am

Hold the Option key and choose Help > Report bug to start a support ticket. Zip and attach the original PDF for us to inspect. Thanks.

xurc · October 26, 2020, 3:10am

Will do. Thanks!