Fix blurry bloated OCR PDF?

PointlessOne · February 7, 2021, 8:14pm

I often have to work with old papers and they often come either as bad transparent text over image or just a scan. OCR in DT is OK for text and sometimes produces better text for existing transparent text.

The problem is that OCR reduces image quality a lot. What was a crisp scan becomes a blurry mush.

Before:

After:

What’s more, it doesn’t even reduce file size. Original (before) is 365.1 KB, OCR is 971.4 KB.

Increasing resolution to 300 dpi produces better images but results in a 2.9 MB file. 8x the size! It’s not a complete dealbreaker but is a little unreasonable for a 6 page paper. When it comes to bigger papers or books it gets somewhat ridiculous.

Is there a way to keep original scans in place and only add text over them?

BLUEFROG · February 7, 2021, 10:29pm

No, this is not possible with the current OCR engine. Development would have to weigh in on this.

Hankk2 · February 7, 2021, 11:52pm

+1. OCR works well but this issue is a real frustration.

rmschne · February 8, 2021, 9:12am

Have you looked into how the more specialised PDF apps (Acrobat, PDF Pro, and there are more) can help with this?

PointlessOne · February 8, 2021, 9:50am

I haven’t yet. Though asking here was worth it since integration is nice and if there was a way to make it work it’d be an ideal solution

rmschne · February 8, 2021, 10:15am

I guess if you did it you could spot the differences and then post that information here. Those differences might not be detected and DEVONthink might appreciate hearing.

PointlessOne · February 8, 2021, 10:36am

I don’t quite follow how my experiments with third-party software can help here.

From Jim’s reply it looked to me like they understand the issue and limitations of current implementation.

rmschne · February 8, 2021, 10:47am

OK. Just thought it would be helpful.

dmlounsbury · February 9, 2021, 2:47pm

It’s worth noting that this is how PDFPen does OCR - it adds an editable text layer over the image in the PDF file. It also can remove existing text layers if you want to re-OCR a page or document in case the original OCR had problems.

amalis · February 9, 2021, 3:02pm

If you use Setapp, PDFPen is included.

chrillek · February 9, 2021, 3:07pm

Editing the OCR layer is a feature of PDFPen Pro. Setapp mentions only PDFPen, without the Pro. Although Setapp says that one can edit the OCR layer. Confusing.

amalis · February 9, 2021, 3:17pm

I can confirm that the PDFpen included in Setapp includes OCR capabilities.

Macster · February 10, 2021, 1:05am

It is normal that an OCR software adds a text layer to the PDF, but quite a few such apps are unable to preserve the scanned images as-is, which is a serious limitation that a lot of OCR apps for the Mac suffer from, including what’s bundled with DEVONthink. Often this comes from being based on Apple’s PDFKit software framework which can read, but apparently cannot write the bi-level JBIG2 and CCITT Group 4 fax compression codecs which are typically used for “black text on white background” scans.

The expensive Acrobat PDF software is said to do this properly, and OCRKit (http://ocrkit.com) can do it, too, at least it did work with the documents I processed with it in autumn 2020.

streborg · February 10, 2021, 11:10pm

I too have this issue. However, an earlier thread titled “After OCR, PDF reduces size a lot” indicates the problem has been fixed. How do we get the fixed code?

BLUEFROG · February 10, 2021, 11:18pm

Welcome @streborg

Did you update to 3.6.2?

streborg · February 11, 2021, 1:40am

Yes, I have tested this on 3.6.2. Problem persists. Here is a small portion of the original PDF.

streborg · February 11, 2021, 1:41am

Here is a small portion of the OCR’d PDF.

nathan · October 5, 2021, 2:56am

Hello. Has there been any updates on this topic? I have DT 3.7.2 and see a definite reduction of crispness after having the builtin OCR engine add the text layer, even with DPI set to 300.

Thanks

aedwards · October 5, 2021, 1:19pm

We are currently waiting for the next update from ABBYY

nathan · October 6, 2021, 3:26pm

Do you have an ETA, or an interim workaround? From this thread, I see you have been waiting for at least 8 months for this ABBYY update.

My wife is not happy with the fuzzy results of her OCRed documents. I notice it myself, but her eyes are much more sensitive to it, and it bugs her much more.