System Dialog "Print to PDF" yields random OCR text

Jeff839 · May 19, 2024, 11:43pm

When I display a document (say from Bank of America) in my Safari window and then use the system “Print to PDF” option and the “Save PDF to DEVONthink 3” sub option, the OCR is garbage (even though I can copy and paste correct text from the Safari display of the page.)

This seems to be more likely with Bank of America sites. What am I doing wrong?

BLUEFROG · May 20, 2024, 1:55am

We have no control over the output of printing to PDF from any application. DEVONthink only receives the file, nothing more.

cgrunenberg · May 20, 2024, 5:13am

Did you actually perform OCR? A PDF printed to another app should rarely need OCR. Or do you just mean that the text layer is garbage? This is not controlled by DEVONthink as the system generates the PDF and then send it to DEVONthink.

Jeff839 · May 20, 2024, 6:38pm

What I’m experiencing seems somewhat random, in that some documents “printed” from BoA have correct OCR. Is it possible to instruct DT to OCR every document that shows up in its inbox, so there’s a consistent text layer?

Sincerely,

Jeff

BLUEFROG · May 20, 2024, 6:48pm

This is a misunderstanding or misapplication of terminology.

OCR is a process of recognizing shapes in images as letterforms and creating text from it, very commonly as a text layer on a PDF.

Printing a document from a text-based original, say a rich text file, web page, or a PDF, already includes text. There is no need to do OCR on such documents and it’s possible to result in a less accurate document, if you do.

Open a support ticket and attach a problematic PDF.