Fix blurry bloated OCR PDF?

Here is a small portion of the OCR’d PDF.
PDF after OCR

Hello. Has there been any updates on this topic? I have DT 3.7.2 and see a definite reduction of crispness after having the builtin OCR engine add the text layer, even with DPI set to 300.

Thanks

We are currently waiting for the next update from ABBYY

Do you have an ETA, or an interim workaround? From this thread, I see you have been waiting for at least 8 months for this ABBYY update.

My wife is not happy with the fuzzy results of her OCRed documents. I notice it myself, but her eyes are much more sensitive to it, and it bugs her much more.

We cannot give you an ETA as we are waiting on ABBYY’s development. We have no say or control in that.

Optionally, you could try a third party OCR application to see if it produces the desired results, then import the files post-OCR.

I see that my DT is using ABBYY Finereader Engine 11.x, yet ABBYY has released 12.x. (See System requirements and specifications - ABBYY FineReader Engine) There must be something you are still waiting for from them before migrating DT to the latest ABBYY Engine 12.x. I am not a software developer and am in no way trying to second guess your decisions. I’m just wanting to know if I should really start looking for an alternate OCR tool (potentially $$) or if improvement in ABBYY DT module is imminent.

Thanks

Are there any updates on this topic?

Related question: Is it still true, that DT OCR converts a black and white scan to greyscale? My b/w Scans become blurred after being OCR’d by DT. Looks like they are no longer b/w but greyscale. Is there any way to keep a crisp black and white scans after DT OCR?

We are currently waiting for a major update from ABBYY that will include native Apple Silicon support, we do not have any release dates from ABBYY for this at present.

If the input document is black and white it shouldn’t convert to greyscale, I will check and add a fix if needed

Hey @aedwards thanks for your quick reply

I guess I am facing the issue described in this post: Create small size high quality black&white OCR'ed PDFs from scans

Is there any way to implement the described workaround into DT Pro to keep the high quality 1 bit black and white scans even after OCRing it with DT?

It doesn’t feel good to loose image quality on b/w scans and even increase the file size for each document.

We are looking at option to improve this. The issue is due to a workaround to a problem in the current release of ABBYY’s OCR.

Alright thank you! Would implementing the workaround I linked to above be an option (i.e. internally converting PDFs to tiffs before OCRing)?

DEVONthink V2 used to convert PDFs to tiffs however there were various issues with that approach so it would be unlikely that we would choose that option

1 Like

I’m sad to see that even with all the great improvements of DT 3.8.1, you are still bundling ABBY Fine Reader 11.x Engine, even though ABBY has 12.x engine for Mac available already.

I spent quite a bit of money on DT licenses for our several Mac computers and mobile devices and a local WebDAV server to get away from the Evernote ecosystem. But now my SO refuses to use DT because of the blurry PDFs that result from the ABBY OCR engine. The blurriness doesn’t bother me much, but I am just a nerdy engineer and she is a fine artist.

Are there any updates on this front?

Thanks

My OCRd PDFs are not blurry. So maybe some more detail might be helpful: where do the files come from, what are their properties (dpi etc), what are your settings for OCR.

And is there any indication that Abby‘s engine in version 11 generally produces blurry PDFs whereas version 12 does not?

3 Likes

I don’t think your source materials are of the right resolution. I’ve never had a problem with blurry PDF’s unless they were less than 144dpi, at which point your problem is the source materials. Upgrading to ABBY 12 won’t solve your problems,. e have it running on an office Mac with a sheet fed scanner and it doesn’t fix bad source files.

The effect is definitely there but it is mostly obvious if the source PDF contains a 1-bit black and white image. This may sound like an uncommon picture format but it is not if you use one of the great and officially recommended ScanSnap document scanners. By default they analyze the scan and intelligently use the picture format with the least number of colors to minimize the file size. For true black and white text without any pictures, logos etc. this results in exactly this PDF with a 1-bit black and white image.
Unfortunately the Abby OCR engine needs to recreate the picture layer but doesn’t support a 1-bit black and white picture format. Thus it converts it into a grayscale picture and applies anti-aliasing. Together with JPEG artifacts this blurs the text significantly and at the same time increases the file size.

I have little hope, this will be fixed in DT with an update of the Abby FineReader engine. I have a license for Abby FineReader PDF for Mac and the latest version 15.2.0 shows the exact same issue. But I’d be happy to be proven wrong.

A working but a bit more complex solution is to configure the ScanSnap Home app to scan without OCR into a folder somewhere on your Mac. Then make a Hazel rule, that checks this folder for new PDFs without OCR (I simply test on content does not contain letter "a“). Then I run an AppleScript that OCRs the PDF via PDFPen Pro, feeds the result into DT, deletes the original PDF and quits PDFPen Pro unless it has other documents open.

Here a short comparison.


After OCR with PDFPen Pro (= exact the same resolution as the original scan). File size of whole document: 107,2 KB


After OCR with DT via Abby OCR engine. File size of whole document: 284,2 KB

So clearly less sharp and almost 3x larger file. :frowning_face:

1 Like

I am using a Canon Scan Front 300 business-class scanner set to 300dpi, PDF output, and internal OCR disabled. I actually have two of these scanners, with the same settings, and DT/ABBY OCR is fuzzy from both.

And I am sad to hear from Pete248 that the latest ABBY engine won’t improve things. I don’t have Hazel or PDFPen Pro. I see that PDFPen Pro costs $129 and I would be willing to buy it to solve this problem. Is it possible to have DT itself watch a folder and call PDFPen OCR? Or would Hazel be required?

Hello Nathan,

have a look at my solution (which has been posted earlier already in here). You will need Hazel for it but no additional OCR tool. OCR will be performed by DevonThink.

Probably you can even circumvent the usage of Hazel by using macOS folder actions and an Automator script…

Hi. Thanks for the reminder! I had read your posting a while ago but thought it wouldn’t apply in my case since my files are a mix of color, B&W, etc. and not limited to 1-bit depth.

What does your TIFF technique do for color scans that need OCR?

Thanks

The script uses ghostscript (gs) to convert to tiff4g. This tiff4g output device converts everything to 1bit BW Tiffs, even Color PDFs.

If I am scanning color pages and I need the color I use in my scanner a different scan settig that will place the scanned file in a “color” folder. PDFs inside this folder will not be converted to TIFFs. They will instead be directly forwarded to DT (using an identical applescript) to get OCRed there.

If I have a mixture of BW and color pages inside a PDF sometimes I scan twice and merge the BW and color pages into a new pdf manually. Not ideal but most of my scans are pure BW, so for me this is not an issue.