OCR compression

Altostratus · July 21, 2019, 6:49am

Hello,

Using DT3b4 here. I notice that PDFs that are converted with OCR to PDF+Text consistently result in much larger file sizes. Last example is a two-page NYT article which is 353kb as downloaded from their website, and when converted weighs 2,5mb. Surely the added (hidden) text doesn’t represent that much data, compared to the image element of the PDF?

Thanks.

cgrunenberg · July 22, 2019, 10:12am

Unfortunately it’s not possible to just add text to the original document due to the limitations of the Abbyy engine, the PDF document is completely recreated.

Pete248 · August 31, 2019, 1:12pm

Is this still true for DT3b7? I’ve seen that the Abby engine was updating after I’ve installed b7.

And if the PDF is still completely recreated during OCR in DT3, would you agree, that at least for Scansnap Home users it is better to disable the OCR within DT3 and instead do the OCR within Scansnap Home? As each recreation of the PDF applies lossy compression to the picture layer, the scanned image in the PDF gets worse with every recreation.

BLUEFROG · September 1, 2019, 4:16pm

Nothing has changed with the OCR in the ABBYY engine’s basic functions.

You can do the OCR in whatever application you see fit. I personally wouldn’t necessarily advocate one over the other.

As each recreation of the PDF applies lossy compression to the picture layer, the scanned image in the PDF gets worse with every recreation.

This would only be true if you were doing OCR, converting the file back to images, then doing OCR again… repeatedly.

SebMacV · September 2, 2019, 8:12am

For what it’s worth, I also run ABBYY Finereader as a standalone app (12.1), and this business of increased file size is still an issue. There are slightly more opportunities for tweaking your scan output, e.g. you can set compression and image quality by way of a slider. I’ve always wondered, actually, whether there’s much difference between the ABBYY engine that ships with DTP or the standalone app in terms of managing file size or OCR accuracy (but then again, never wondered long enough to actually run some tests!).

Altostratus · September 2, 2019, 9:01am

Same here

jbp · September 3, 2019, 2:29am

I have some wild numbers with OCR, some 1-2mb files are coming out over 100mb. Unfortunately it appears random, or at least I can’t see a pattern yet. My database is going to blow out pretty quickly at this rate.

Altostratus · September 3, 2019, 12:38pm

I don’t have time to do more detailed tests but here’s something I tried quickly : doing OCR from within DT and, separately, in FineReader, and comparing the results. The first four files are from the FDR Library (typewritten text from the 1940s), which have already been OCRed but often with subpar results, so it never hurts to scan them again with a modern engine. The last file is from the NYT archives, which isn’t OCRed at all when downloaded.

If you read my original post, this is the PDFs I’m having trouble with (from NYT), and this is confirmed by the random tests below: all files are reduced in size when OCRed by DT and increase in size when going through FR (default settings). However, the NYT file increased in size on both cases, although I can’t understand how adding text to an image can result in such a difference.

So from this simple test there doesn’t seem to be an advantage of using FR instead of DT for file size reasons.

Just my 2c.

EDIT: “FR OCR” column for “pc0183” should be green (22 mb is less than 25mb). Should’ve done this with a formula

BLUEFROG · September 3, 2019, 1:32pm

Welcome @jbp

Are these scans of your own?

jbp · September 4, 2019, 8:35am

No, they are pdf versions of academic books. I’m also having regular crashes in the middle of OCR jobs. Happy to send logs

BLUEFROG · September 4, 2019, 12:29pm

Hold the Option key and choose Help > Report bug to start a support ticket. Thanks.