OCR wild file size and a clue?

OCR at 150 dpi typically bloats a 2 mb file to 75 mb. Even 450 kb to 6 mb. This happens with all pdf files from any source, and it’s been happening for years. And the huge OCR’d file is a lot less clear.

In another post, I saw you requested the OCR.plist file – but there’s no such file in application support for ABBYY. There’s only languages.plist and DTOCRHelper.app. Is that the problem? Something else?

Thanks.

That file is not usually present. DT offer it for download to enable OCR logging. Alan mentions the procedure in numerous post, here, for example.

Ah, I misread that post. Thanks, will do.

If you select the 75Mb file in DEVONthink and from the menu select Tools->Inspectors->(Document) Properties. Does the Producer entery at the bottom of the list say “Abbyy FineReader xx” or “… Quartz PDF Context”?

The originals say Quartz or Calibre, but those don’t work well so I re-do OCR in DT and it gets bigger. If the problem is OCR’ing a pdf that has already been OCR’d differently, is it possible to strip the old OCR out of the pdf? Or is there a better way to do this with an existing PDF?

No, redoing OCR isn’t the issue. There won’t be multiple text layers. The text layer of the original is discarded, and each page’s image is processed and parsed, then reassembled into the PDF.

Thanks, Jim. So… why does the size of the pdf get multiplied by about 10? Yikes!

What are your OCR preferences?

Here are my OCR preferences:

What does the Producer entry say for the OCR’d document? If it says “… Quartz PDF Context” the document has been resaved after OCR, this usually happens to transfer annotations or metadata to the new document. This is done using Apples PDFKit which does not provide the same level of compression as ABBYY.

Is it possible to share the OCR’d document? I will send you a direct message with my email to send it too.

The OCR’d document’s Producer is Quartz PDF Context. I’ll send you the two before-and-after documents now.

I tried also doing OCR on the original on my ipad to see if there was a difference:

Original: 1 MB
DT3: 21 MB
DTTG: 70 MB

Note DEVONthink and DEVONthink To Go use different OCR engines and have their own settings as well.

Hold the Option key and choose Help > Report bug to start a support ticket. Please compress and attach the original PDF too. Thanks!

Will do. I’ve already sent the PDFs to Alan as well via email.