DTPO OCR increases PDF size by a factor of 9.5

xor · November 24, 2011, 11:10am

Hi,

I found that OCRing an existing PDF in DTPO 2.3.1.1 made its size go up from its original 60kB to 570kB, which is an increase by a factor of 9.5. The original PDF is a 2 page 200DPI B/W document scanned outside of DTPO.

Is there a way to avoid this huge increase in size?

Thanks,
Olaf

Bill_DeVille · November 24, 2011, 4:10pm

Of course. Don’t check the option to retain the resolution of the original scan (DTPO Preferences > OCR).

The default settings in Preferences > OCR are 150 dpi resolution and 50% image quality, for the searchable PDFs saved after OCR. Those settings are a compromise between file size and view/print quality. They will result in a searchable PDF that’s approximately 1.3 to 1.5 times the size of the original scan image. If the original scanned image has good white balance and contrast (the ScanSnap does an especially good job for scans at 300 dpi), I find these searchable PDFs quite acceptable for onscreen reading.

Most of my scans are done at 130 dpi and 50% image quality, resulting in searchable PDFs of about the same size as the image-only PDFs produced by the ScanSnap. The searchable PDFs are better than FAX quality.

When I’m scanning a batch of short documents such as receipts, invoices and the like for tax purposes, I set Preferences > OCR to 96 dpi and 50% image quality, resulting in searchable PDFs that are significantly smaller than the PDFs produced by the ScanSnap.

xor · November 24, 2011, 4:47pm

Thanks Bill,

I did not notice these settings, very helpful!

However, with 130ppi/50% the resulting document is still 6 times as large as the original PDF. Even with the lowest possible settings 75ppi/1% the PDF is still twice as large as the original at 200ppi and also unreadable of course.

I wonder why DTPO inflates the file size that much. Does it maybe always encode to greyscale/color even though the original file is black and white? And why does DTOP reencode the image at all? Wouldn’t it be possible to just add the text information and leave the image as it is?

Bill_DeVille · November 25, 2011, 4:11pm

I can’t understand how you got a large increase in file size at 130 dpi resolution of the image layer created after OCR.

The PDFs sent from my scanner to DTPO have a resolution of 300 dpi.

How were your PDFs created?

Yes, it would be nice if the original image layer were retained after OCR - but it doesn’t work that way.

xor · November 25, 2011, 8:47pm

Thanks Bill, for your reply. Answers to you questions are below:

I did a few more tests and this is what I think is going on: the original PDF is a 200DPI/PPI black and white (!) document (I scanned it in Windows with a Plustek PS286 Scanner in B/W mode). During OCR, DTPO obviously converts this into a greyscale/color document, which consumes much more space.

I did a test with “Adobe Acrobat 8 Standard” and OCR-ing the same black and white document will still result in a black and white PDF, retaining the small file size.

So, the bottom line seems to be: during OCR, DTPO converts black and white PDFs to greyscale PDFs, thereby increasing the file size by a factor of about 9.5 (if you compare B/W = 1 bit to greyscale = 8 bit the factor seems to make sense).

I just wish it would be possible for DTPO to keeps B/W PDF as B/W after OCR.

alanshutko · November 26, 2011, 2:41pm

Apple’s PDF routines create larger PDFs than Acrobat does. This has been discussed before in these forums. I think the upshot is that Apple’s PDFKit doesn’t apply certain allowable forms of compression in the PDFs it creates, which makes the files bigger but also faster for Preview to render.

xor · November 26, 2011, 4:24pm

Thanks, interesting post, though it is a bit disappointing that Devon does not consider improving the PDF engine.

Just to check alternatives, I did some test runs and this is what I found:

Original scanned PDF, 200DPI BW: 50kB

After OCR using Adobe Acrobat 8 Standard (Windows): 45kB
After OCR using PDFPenPro 5.6.1 (Mac): 130kB
After OCR using DevonThink Pro Office (keep DPI, 50%): 620kB
After OCR using DevonThink Pro Office (200 DPI, 50%): 350kB
After OCR using DevonThink Pro Office (150 DPI, 50%): 250kB

Unfortunately, it is not possible to automatically integrate Acrobat (Windows) into my workflow, so I will maybe use PDFPenPro for the OCR. I will also check PDF Shrink as mentioned in the other thread.