PDF encoding after OCR

Ryuji · October 10, 2007, 7:33pm

The file created by DTPO after OCR is much bigger than those created by Acrobat, given comparable image quality. This is because Acrobat allows Advanced -> PDF Optimizer where I can specify the PDF version to be the latest (which includes more effective compressed data format for b&w image) and also specify noise-cleaning filters that greatly reduce the file size. I use very light filtering and also select maximum image quality, and then OCR, in Acrobat but I consistently get better file size at a comparable image quality than those files made by DTPO with its OCR functions.

In the last update we started seeing the resolution and image quality sliders in the Preferences of DTPO. I would like to see some filtering and compression option similar to those found in Acrobat’s Advanced -> PDF Optimizer in the Preferences of DTPO. Otherwise, I need 10x more disk space and I’m talking about 100s of gigabytes of storage space, since I’m digitizing my whole file cabinet.

Thanks

Ryuji

Ryuji · October 21, 2007, 8:37pm

To be more specific, the OCR-processed PDF files are saved in PDF1.4 format in DTPO, while PDF1.5 or newer format allows more powerful data compression. I suspect that the PDF file is actually produced by the IRIS module.

In addition, I believe that OCR-processed PDF files, especially those in B&W, should be encoded in JBIG2 algorithm within the PDF file, to greatly reduce file size (usually 1/4 to 1/2 of the original size). Adobe Acrobat uses JBIG2 when PDF optimization is applied to scanned images, and even if I set the quality slider to the maximum, I get drastically smaller file size without losing any perceptual quality of the image.

I run Acrobat from Adobe CS2, Preview and DTPO on Tiger and have no problem handling the JBIG2 compressed PDF files in PDF1.5 or PDF1.6 format. Unlike some other compression schemes, I don’t notice the time lag to open these compressed PDF files to be any longer than those that don’t use JBIG2 algorithm.

I request DTPO and IRIS OCR together work out a solution to use these technologies to produce more compact and equally high quality searchable scanned image PDF files.

radarseven · December 6, 2010, 2:00pm

This is an old post, but still relevant three years later.

I’m using the latest version of DTPO (2.0.6) which I believe now uses Abby for OCR. Searchable PDF’s created with my ScanSanp directly into DTPO seem way to big. Taking the resulting file and passing through Acrobat’s PDF Optimizer almost always reduces the file size by 50%, which no loss in clarity.

Along the lines of the original post, it would still be nice to see compression options to reduce these files sizes.

korm · December 6, 2010, 4:17pm

Have you worked with the compression options on your ScanSnap, and the options in DTPO > Preferences > OCR > Resolution / Quality?

Ryuji · December 6, 2010, 6:02pm

Yes, but that optimizer just lets you play with the JPEG compression parameters. JBIG2 compression is a whole new level of compression, where you can still keep high res image but compresses into much smaller file. Don’t get confused with this difference. JPEG and JBIG2 are different compression formats (but all modern PDF viewers, including ones on iPad, works with JBIG2 compression). I basically abandoned OCR functions of DTPO and just use Acrobat Pro for everything.