The suggestions by korm were ones I would have made.
What happens when OCR is performed is that a new copy of the scanner output file is created as a rasterized image of the output file and a text layer. The procedures used to copy the image from the scanner output file to the new searchable PDF file are those available in OS X.
I do a lot of scanning with a ScanSnap. I usually choose the Best setting for this reason: If automatic color detection is set, copy containing color will result in a scan with only half the resolution of black & white copy. Your Better setting will result in 300 dpi resolution for black & white copy, but only 150 dpi resolution for copy containing color. We recommend 300 dpi or better resolution for good accuracy in text recognition. The Excellent image quality setting in ScanSnap Manager would be overkill for OCR purposes, and I don’t recommend it as the file sizes are very large. The Best setting in ScanSnap Manager settings will result in 300 dpi resolution of copy that contains color. That’s why choose that setting.
Now let’s look at DEVONthink Pro Office Preferences > OCR. You checked the option to retain the resolution of the original scan. Yes, that will result in large files. I don’t check that option. For most documents (receipts, invoices, contracts, letters), I’m satisfied with searchable PDFs that are easy to read and that produce readable results when printed.
I check the options for 130 dpi and 50% image quality. That’s better than FAX quality, and because the ScanSnap produces images with good sharpness and contrast, I’m satisfied with the resulting searchable PDFs for most purposes. (But if I needed to include a PDF page image in a publication, I would choose it from the original scanner output file, which I’ve told Preferences > OCR to send to the System Trash following text recognition.)
As a result, most of the searchable PDFs in my databases are of reasonable file size. If the original paper copy was black & white, my searchable PDF will usually be smaller than the scanner output PDF file. Images in the paper copy can considerably increase file size. I’m usually satisfied with 50% image quality of the JPEG images (note that compression applies only for colored images).
I’ve tested just about every OCR application for Mac. Adobe’s Acrobat Pro can produce smaller searchable PDFs because Adobe uses proprietary code to create the image in the searchable PDF that’s more space-efficient than Apple’s code. (And since Adobe bloated its sale price for Acrobat to a ridiculous level, I won’t support that business model.)
But recognition accuracy is the most important feature of OCR apps, to me. I’ve got Acrobat Pro (pre-bloat price version), IRIS, PDFpenPro, OCRKit, ABBY FineReader for ScanSnap, FineReader 12.0.4, FineReader 10 (for Windows), ABBYY OCR in DEVONthink Pro Office and a couple of others that have faded in development and I don’t bother to install on my working Macs. In my tests, the ABBYY OCR (all of the versions) are best in accuracy. Acrobat Pro (pre-price bloat version) is not quite as accurate, next comes IRIS, and the rest of the pack I don’t find quite good enough.
I use a Xcanex portable book and document scanner that currently only runs under Windows, and comes with Abbyy Finereader 10. I save the PDF files of book scans at maximum quality, as that lets me retain good resolution for alternatives to OCR to PDF, e.g., OCR to Windows or HTML. At maximum quality, those searchable PDFs are huge files. For those I want to keep in a more practical file size, I use the Web page conversion setting of PDFShrink, which results in a 90% reduction in file size and still good onscreen appearance and print quality.
Of course, there’s another approach to attaining small PDF sizes after OCR, that is allowed under many OCR applications (but not DEVONthink Pro Office OCR, currently). Instead of retaining an image of the text in the paper copy, the PDF image layer is constructed of the converted text. Here’s the rub: no OCR software is 100% accurate, and any marks or smudges in the paper copy will result in errors of recognition. In my opinion, that makes the resulting PDF non-trustworthy. Of what use is a PDF copy of a receipt if the price is copied in garbled form, such as incorrect recognition of the original copy, $118.73 as %71B.13?