I often have to work with old papers and they often come either as bad transparent text over image or just a scan. OCR in DT is OK for text and sometimes produces better text for existing transparent text.
The problem is that OCR reduces image quality a lot. What was a crisp scan becomes a blurry mush.
What’s more, it doesn’t even reduce file size. Original (before) is 365.1 KB, OCR is 971.4 KB.
Increasing resolution to 300 dpi produces better images but results in a 2.9 MB file. 8x the size! It’s not a complete dealbreaker but is a little unreasonable for a 6 page paper. When it comes to bigger papers or books it gets somewhat ridiculous.
Is there a way to keep original scans in place and only add text over them?
I guess if you did it you could spot the differences and then post that information here. Those differences might not be detected and DEVONthink might appreciate hearing.
It’s worth noting that this is how PDFPen does OCR - it adds an editable text layer over the image in the PDF file. It also can remove existing text layers if you want to re-OCR a page or document in case the original OCR had problems.
Editing the OCR layer is a feature of PDFPen Pro. Setapp mentions only PDFPen, without the Pro. Although Setapp says that one can edit the OCR layer. Confusing.
It is normal that an OCR software adds a text layer to the PDF, but quite a few such apps are unable to preserve the scanned images as-is, which is a serious limitation that a lot of OCR apps for the Mac suffer from, including what’s bundled with DEVONthink. Often this comes from being based on Apple’s PDFKit software framework which can read, but apparently cannot write the bi-level JBIG2 and CCITT Group 4 fax compression codecs which are typically used for “black text on white background” scans.
The expensive Acrobat PDF software is said to do this properly, and OCRKit (http://ocrkit.com) can do it, too, at least it did work with the documents I processed with it in autumn 2020.
I too have this issue. However, an earlier thread titled “After OCR, PDF reduces size a lot” indicates the problem has been fixed. How do we get the fixed code?
Hello. Has there been any updates on this topic? I have DT 3.7.2 and see a definite reduction of crispness after having the builtin OCR engine add the text layer, even with DPI set to 300.
Do you have an ETA, or an interim workaround? From this thread, I see you have been waiting for at least 8 months for this ABBYY update.
My wife is not happy with the fuzzy results of her OCRed documents. I notice it myself, but her eyes are much more sensitive to it, and it bugs her much more.