OCRd Doc Sizes

I’d prefer to use the built in OCR in DTPO but I am experiencing enormous discrepancies in document sizes.

Working from the same original PDF I get a 221k file using Adobe Acrobat and a 3.3mb file using DTPO.

The discrepancy is enormous. Any ideas on why there is such a huge difference?

See viewtopic.php?f=20&t=10045 for example. However, the final release will improve this.

Christian,

I’ve noticed a different problem that I hope gets fixed. Sometimes, when scanning, the automatic page detection spits out a blank page anyway. If I delete selected page in DTPO, what I see happen is the file size increase (!)

A two page PDF shouldn’t be larger than a one page… so something is different that adds a bunch of empty file space.

During the OCR process the original scan image is re-rasterized. During the public beta period all scan images are reconverted as 300 dpi color images, even for black & white scans.

The default settings of 150 dpi and 50% image quality in DTPO Preferences > OCR provide a compromise between view/print quality and file size.

The release available on 24 February, 2010 makes some changes that will generally reduce file sizes of the searchable PDFs, especially for scans made in black & white. There will also be an option to save the searchable PDF in the original scan quality, but of course with a large file size (which could be reduced afterwards by a utility such as PDF Shrink).

I do most of my scans with a ScanSnap set for black & white at Best quality. The resulting PDFs at default OCR Preference settings (or perhaps tweaked to 200 dpi and 75% image quality) have a reasonable file size and are quite readable/printable.

I understand the trade offs and quality of compression and resolution.

I have a question about manipulating a PDF+Text within DTPO. Actions which should reduce file size, such as deleting a blank page, have an atypical result. Namely, about every file I’ve deleted a page from results in a larger file. Sometimes dramatically, such as a 153k two page file jumping to 445k after deleting a page.

It would not be the case of the OCR process rerasterizing the file, as that was the action performed on the file. Rather, in the split pane three-column view, I right clicked on a page of the PDF, and selected “delete selected page” from the contextual menu of a PDF that had already been OCRd.

Why would deleting a page from an already PDF+Text file cause it to increase in size?

That’s because the edited PDF is saved using Apple’s PDFKit code, which isn’t very efficient as to file size. Yes, it’s counter-intuitive that removing a page from a PDF results in an increase in file size rather than reducing the file size. Let’s hope that Apple will get around to optimizing PDF file size in the future.