Fix blurry bloated OCR PDF?

I often have to work with old papers and they often come either as bad transparent text over image or just a scan. OCR in DT is OK for text and sometimes produces better text for existing transparent text.

The problem is that OCR reduces image quality a lot. What was a crisp scan becomes a blurry mush.

Before:

After:

What’s more, it doesn’t even reduce file size. Original (before) is 365.1 KB, OCR is 971.4 KB.

Increasing resolution to 300 dpi produces better images but results in a 2.9 MB file. 8x the size! It’s not a complete dealbreaker but is a little unreasonable for a 6 page paper. When it comes to bigger papers or books it gets somewhat ridiculous.

Is there a way to keep original scans in place and only add text over them?

4 Likes

No, this is not possible with the current OCR engine. Development would have to weigh in on this.

+1. OCR works well but this issue is a real frustration.

1 Like

Have you looked into how the more specialised PDF apps (Acrobat, PDF Pro, and there are more) can help with this?

I haven’t yet. Though asking here was worth it since integration is nice and if there was a way to make it work it’d be an ideal solution

I guess if you did it you could spot the differences and then post that information here. Those differences might not be detected and DEVONthink might appreciate hearing.

I don’t quite follow how my experiments with third-party software can help here.

From Jim’s reply it looked to me like they understand the issue and limitations of current implementation.

OK. Just thought it would be helpful.

It’s worth noting that this is how PDFPen does OCR - it adds an editable text layer over the image in the PDF file. It also can remove existing text layers if you want to re-OCR a page or document in case the original OCR had problems.

If you use Setapp, PDFPen is included.

Editing the OCR layer is a feature of PDFPen Pro. Setapp mentions only PDFPen, without the Pro. Although Setapp says that one can edit the OCR layer. Confusing.

I can confirm that the PDFpen included in Setapp includes OCR capabilities.

It is normal that an OCR software adds a text layer to the PDF, but quite a few such apps are unable to preserve the scanned images as-is, which is a serious limitation that a lot of OCR apps for the Mac suffer from, including what’s bundled with DEVONthink. Often this comes from being based on Apple’s PDFKit software framework which can read, but apparently cannot write the bi-level JBIG2 and CCITT Group 4 fax compression codecs which are typically used for “black text on white background” scans.

The expensive Acrobat PDF software is said to do this properly, and OCRKit (http://ocrkit.com) can do it, too, at least it did work with the documents I processed with it in autumn 2020.

I too have this issue. However, an earlier thread titled “After OCR, PDF reduces size a lot” indicates the problem has been fixed. How do we get the fixed code?

Welcome @streborg

Did you update to 3.6.2?

Yes, I have tested this on 3.6.2. Problem persists. Here is a small portion of the original PDF.
Original PDF

Here is a small portion of the OCR’d PDF.
PDF after OCR

Hello. Has there been any updates on this topic? I have DT 3.7.2 and see a definite reduction of crispness after having the builtin OCR engine add the text layer, even with DPI set to 300.

Thanks

We are currently waiting for the next update from ABBYY

Do you have an ETA, or an interim workaround? From this thread, I see you have been waiting for at least 8 months for this ABBYY update.

My wife is not happy with the fuzzy results of her OCRed documents. I notice it myself, but her eyes are much more sensitive to it, and it bugs her much more.