OCR Settings: Compress PDF, Deskew, Page Orientation

Pete248 · October 16, 2019, 11:59am

I still don’t understand what the compress PDF checkbox in the OCR settings really does.

IMO a PDF without OCR from a ScanSnap scanner contains already a compressed JPEG picture layer.
During OCR in DT3 a kind of text layer is added plus optionally metadata and keywords.

If I enable Compress PDF does this recompress the original JPEG picture layer or compress the whole PDF including the added metadata and keywords and is this compression lossy or not?

What influence have the deskew and page orientation checkboxes in this context? Does enabling either of these force a recreation of the picture layer or is it recreated anyways during the OCR process?

Reason for my question:

In DT2 I’ve seen both inflating of the file size (which seems to be fixed in DT3) and significantly increased JPEG compression artefacts when doing OCR in Devonthink. For that reason I used the old ScanSnap Manager software primarily for OCR instead of DT2.

That said I like the “enter metadata after text recognition” dialog in DT3. When OCR for incoming scans is enabled in DT3, photos of documents send to DT3 get OCRed automatically otherwise I have to do this manually. Thus I like to use OCR in DT3 if quality and file size are comparable to what I get with doing OCR in ScanSnap Home.

aedwards · October 18, 2019, 8:56am

The option to compress PDFs effect the output file in two ways:

If this option is unchecked the ABBYY OCR will export the PDF using the option for best quality, is may result in a slightly larger file size and take more time to produce. With the option checked the OCR will use a balance approach between the quality of the resulting file, its size and the time of processing.
If metadata is added to the OCR’d file (this can be new data entered via metadata dialog or transferred from the original file), we need to re-save the file after adding the metadata, In this case if the option is checked we use compression otherwise the file is saved without compression.

Pete248 · October 23, 2019, 12:38pm

Thanks for the explanation.

I meanwhile did some tests:

Sent documents scanned with a ScanSnap ix500 as PDFs to DT3 and did OCR within DT3 with and without compression enabled.
The resulting OCRed PDF was always significantly smaller than the original PDF even though the OCR text layer has been added. Thus DT3 is always recompressing the picture layer in the PDF during OCR. I can clearly see more JPEG compression artefacts in the OCRed PDF.
The resulting file size is about 29% with compression enabled and 38% with compression disabled of the original file size.
The reduction in file size is less if the original PDF was saved with more compression within the ScanSnap Home app. Seems DT3 then has to deal with more JPEG compression artefacts in the original file which hampers further recompression.

As a result of my test, I’ve now changed my settings in ScanSnap Home for picture quality from auto to best and for compression from medium to low and disabled OCR. This generates a significantly larger PDF intermediary in the ScanSnap Home app, which is automatically sent to DT3. Because of the recompression during OCR in DT3 the file size of the OCRed PDF is still smaller than the PDF I get with a lower quality scan and OCR done in ScanSnap Home. The picture layer within the PDF is sharper as well. As a result better quality with smaller files in compare to doing the OCR in ScanSnap Home.

While I do not see much difference in the precision of the OCR between DT3 and ScanSnap Home, text blocks are easier to select in the PDF OCRed in DT3. In the PDF OCRed in ScanSnap Home I often get text selected all over the page not related to the area I drag the mouse cursor over. One additional reason I prefer to do the OCR in DT3.

So while I found a workflow that fits my needs quite well the whole OCR recompression process in DT3 is a bit of black magic and not easy to understand. Personally I’d prefer to have a setting in DT3 where I can set the recompression rate applied during OCR in maybe 3 steps with the choice of no recompression at all in case I have fine tuned the quality in the scanning app already. But I’m afraid this might be beyond your control and the Abby OCR engine you’ve licensed might handle this in a closed environment.

ehbarnet · October 23, 2019, 4:24pm

I wanted to add a quick note to observe that I am able to compress almost every pdf I download from academic journals using my standalone version of AABBY Finereader. I export as “Text under page image,” Compress using MRC, and set Image quality to low. This typically reduces the file size by half or two thirds and will not produce a noticeable difference in image quality, unless the original pdf contains images or had a poor resolution to begin with. If I were to use DTP3’s OCR functions more systematically, I’d like to have access to these configurations.

Silverstone · October 23, 2019, 6:40pm

If you want to stay along with your Finereader (and there are still many the reasons to), here is a Smart Rule script I’ve written to automate workflow