OCR file size bloat

I find old discussions, but the most pertinent to this question are old and unanswered. So I’m trying with a new topic here.

I’m moving some of the heaps of paper around here into DTPO; and one of the goals is that I’ll be able to find things again, hence everything gets OCRed. The problem I’m having right now is that the PDF files are bloating up in size insanely during the OCR process. I’ve had this issue with ABBYY FineReader OCR, as bundled with other software, before and never found a resolution other than using a different product, but thought I’d have one more go at figuring out if I’m missing the point of some setting before I write ABBYY off as fatally flawed.

Original scan output by ScanSnap Manager using Better quality, auto color detect, and default compression (3).

OCRed by ABBYY in DTPO with Same as scan resolution and 100% quality, which I really would hope would leave the image layer alone (which is what I want).

OCRed by current version of PDFPen Pro.

Original: 863,227 bytes 200 dpi, 8 bits per component
DTPO/ABBYY: 6,390,281 bytes 300 dpi, 8 bits per component
PDFPenPro: 962,848 bytes 200 dpi, 8 bits per component

Which if I’m reading the diagnostics in PDFPen Pro correctly means that despite the “same as scan” setting, ABBYY has resampled the image from 200 dpi to 300 dpi and bloated its size some 6 or 7 times larger than it has any business being. Even in these days of 4 TB drives, that’s unacceptable once you’re talking about tens of thousands of pages of documents.

What does it take to actually have ABBYY leave the image layer alone, or is that not possible and I’m going to have script using some other OCR as part of my workflow?

I agree that the “bloat” seems strange and is certainly unwanted – but the information missing here is “what are you scanning”? What is the content of the original paper document – images, text, text + images – and how many pages. Did you happen to test DEVONthink’s OCR using the same document with different preferences? With different documents? A sample of one might not be dispositive – further testing might be needed.

I’m a touch confused as to what some of those questions have to do with the image layer being resampled in bizarre ways, unless we’re actually at the point of this problem being shipped off to ABBY and they’re trying to pin down why they do this some of the time and not all of the time.

But in case it helps: 2 pages, all text plus one logo, a bit of background shading, and a little bit of scribbling by me in red ink and blue/red date received stamp (enough to have my ScanSnap auto-recognize as a color image), somewhere in the vicinity of 11k characters recognized during OCR. In other words, a typical invoice with lots of stock verbiage and medium-sized print on the back.

Scanning at different resolutions in ScanSnap Manager result in proportional changes to the size, in other words, bloat seems to be percentage-wise, not a fixed amount.

Another sample point, force b&w in ScanSnap Manager:

Original scan: 425,886 bytes, 400 dpi, 1-bit per component
ABBYY: 5,301,978 bytes, 300 dpi, 8-bits per component !!!
PDFPenPro: 542,236 bytes, 400 dpi, 1-bit per component

That’s just so horrible as to actually be somewhat amusing. And yes, Resolution Same as scan is still checked in the DTPO preferences.

The suggestions by korm were ones I would have made.

What happens when OCR is performed is that a new copy of the scanner output file is created as a rasterized image of the output file and a text layer. The procedures used to copy the image from the scanner output file to the new searchable PDF file are those available in OS X.

I do a lot of scanning with a ScanSnap. I usually choose the Best setting for this reason: If automatic color detection is set, copy containing color will result in a scan with only half the resolution of black & white copy. Your Better setting will result in 300 dpi resolution for black & white copy, but only 150 dpi resolution for copy containing color. We recommend 300 dpi or better resolution for good accuracy in text recognition. The Excellent image quality setting in ScanSnap Manager would be overkill for OCR purposes, and I don’t recommend it as the file sizes are very large. The Best setting in ScanSnap Manager settings will result in 300 dpi resolution of copy that contains color. That’s why choose that setting.

Now let’s look at DEVONthink Pro Office Preferences > OCR. You checked the option to retain the resolution of the original scan. Yes, that will result in large files. I don’t check that option. For most documents (receipts, invoices, contracts, letters), I’m satisfied with searchable PDFs that are easy to read and that produce readable results when printed.

I check the options for 130 dpi and 50% image quality. That’s better than FAX quality, and because the ScanSnap produces images with good sharpness and contrast, I’m satisfied with the resulting searchable PDFs for most purposes. (But if I needed to include a PDF page image in a publication, I would choose it from the original scanner output file, which I’ve told Preferences > OCR to send to the System Trash following text recognition.)

As a result, most of the searchable PDFs in my databases are of reasonable file size. If the original paper copy was black & white, my searchable PDF will usually be smaller than the scanner output PDF file. Images in the paper copy can considerably increase file size. I’m usually satisfied with 50% image quality of the JPEG images (note that compression applies only for colored images).

I’ve tested just about every OCR application for Mac. Adobe’s Acrobat Pro can produce smaller searchable PDFs because Adobe uses proprietary code to create the image in the searchable PDF that’s more space-efficient than Apple’s code. (And since Adobe bloated its sale price for Acrobat to a ridiculous level, I won’t support that business model.)

But recognition accuracy is the most important feature of OCR apps, to me. I’ve got Acrobat Pro (pre-bloat price version), IRIS, PDFpenPro, OCRKit, ABBY FineReader for ScanSnap, FineReader 12.0.4, FineReader 10 (for Windows), ABBYY OCR in DEVONthink Pro Office and a couple of others that have faded in development and I don’t bother to install on my working Macs. In my tests, the ABBYY OCR (all of the versions) are best in accuracy. Acrobat Pro (pre-price bloat version) is not quite as accurate, next comes IRIS, and the rest of the pack I don’t find quite good enough.

I use a Xcanex portable book and document scanner that currently only runs under Windows, and comes with Abbyy Finereader 10. I save the PDF files of book scans at maximum quality, as that lets me retain good resolution for alternatives to OCR to PDF, e.g., OCR to Windows or HTML. At maximum quality, those searchable PDFs are huge files. For those I want to keep in a more practical file size, I use the Web page conversion setting of PDFShrink, which results in a 90% reduction in file size and still good onscreen appearance and print quality.

Of course, there’s another approach to attaining small PDF sizes after OCR, that is allowed under many OCR applications (but not DEVONthink Pro Office OCR, currently). Instead of retaining an image of the text in the paper copy, the PDF image layer is constructed of the converted text. Here’s the rub: no OCR software is 100% accurate, and any marks or smudges in the paper copy will result in errors of recognition. In my opinion, that makes the resulting PDF non-trustworthy. Of what use is a PDF copy of a receipt if the price is copied in garbled form, such as incorrect recognition of the original copy, $118.73 as %71B.13?

Urhm…thank you for the long explanation.

I’ll point out a couple of things:

  1. No matter what manipulations the OCR program feels a need to subject the image to as part of the OCR process, there is no need to save its horribly sub-optimal results to the image layer of the resulting PDF. It should be possible to turn that off…

  2. Apple’s PDF library is somewhat…what’s a nice way of putting this…weak, yeah that’s it weak. So, if you’re dependent on it inside ABBYY, well, yes, you’re pretty much hosed it appears. Incidentally, that code is just as proprietary as that from Adobe, no? (Don’t get me started on what happens if you combine PDF documents containing both Roman and Cyrillic characters using Apple code, that was last year’s trauma.)

Thanks; I’m off to automate things with PDFPen, which while not perfect, at least seems to allow as to how I might actually want my PDF image to be just the way I created it in the first place.

Of course Apple’s code is also proprietary. As I noted, Adobe code in Acrobat Pro can produce smaller file sizes after OCR than the corresponding OS X code used by most Mac OCR apps. One can play with Quartz settings in OS X, but I find it more convenient to use an app like PDFShrink, which has preconfigured setting options instead, if I need to shrink a PDF.

I wish I were as satisfied with the accuracy of text recognition by PDFpenPro as by ABBY apps. I don’t use PDFpenPro for OCR, but I like it for other purposes, such as the ability to create form fields.

There are no perfect OCR apps. Characteristics such as accuracy of text conversion, file size and screen view/print quality can be balanced, depending on one’s needs and preferences.

A question for jradel: why don’t you just do the OCR as part of the original scan with ScanSnap and import or index the OCRed document into DTPO? That’s my workflow. Just checked with a invoice and the pdf OCRed by ScanSnap is 20% bigger than the pdf without. Just curious.

I use Adobe Acrobat Pro to OCR scanned articles, books, and papers. It works pretty well and, as mentioned above, actually shrinks the file size a bit. If you are working with Asian characters, DTPO’s OCR is no use, I am afraid. As far as I know, it doesn’t work with Chinese or Japanese. Too bad.

I typically scan materials into PDF form, index them with DTPO, and batch OCR them when I get around to it (usually when I am plugged in and can let the computer spend some time chewing threw them).

One cool thing about DTPO is that it tells you if something is a PDF with / without text, so you know what you need to OCR. Nice feature!

You apparently are still missing the actual point of my original question. I have a beautiful image layer. Lovely quality. Compressed very well. Is there any way to keep the OCR program sold by Devon Technologies from trashing this part of the PDF? There is no technical justification that I can think for it do this if you don’t want it to resample the image and save the result.

Because I was hoping that paying to move from DTP to DTPO wasn’t a waste of time and money? I now have a workflow using a different OCR program that doesn’t mess with the image layer produced by ScanSnap, which is of extremely high quality almost all the time. AppleScript has its uses.

So I read most of the thread and got maybe 60% of the meaning. You guys know a lot more about this than me. Is there an DTOP/OCR Guidelines summary?
My intension was to understand why an ebook at 9MB once OCR scanned at Quality 100% and Recognition setting at Automatic raised the file size to 456MB.
That’s just nuts to me.
Why scan if the size grows to that size?
All I wanted was keyword searchable ebook that I can load on Kindle. Am I missing something?

You don’t need the quality at 100%. This creates essentially uncompressed images in the document. I would start at 75% Quality.

Don’t use Automatic unless YOU scanned the document. You don’t know what resolution it was scanned at. If it was scanned at 600dpi or higher, you’re ending up with a ton of excess / unused data. You can manually set it at 72 or 96dpi for onscreen reading (it is an ebook, after all).

This is also compounded by the number of pages in the document. Too high a setting with multiple pages yields a document with excess data in it.

Thanks for the feedback and guidelines. I’ll give them a try.