Again, OCR with Scansnap, DTPO, and Finereader

pvonk · November 13, 2014, 10:04pm

I’ve been reading through the (more recent) posts about file sizes when OCR is performed on a PDF and have tested a one-page B&W page (which has a mix of bold and not bold black print - don’t know if that is important). I’ve used a number of combinations using my new iX500 (which means the old S500M goes downstream to the kids) using latest DTPO, Scansnap and Finereader versions. One very unusual result is the following…

Scansnap to file (w/o OCR) results in 388KB. (Best, auto color det., compress=3)
If I then use Finereader to OCR that file, the result is 113K - Huh???

FYI:
I experimented with Scansnap w/o OCR to DTPO which then OCRs (result=782K) and Scansnap w OCR to DTPO w/o OCR (result=408K).

How would Finereader cut the size of a non-OCRed PDF?? I’ve moved the OCRed file (113K) into DTPO and used the convert to plain text and verified that the OCR was good. I also printed out the file and it looks good, not grainy.

I’d hate to introduce Finereader to the workflow when scanning many documents in one session - a bit more work involved. Scansnap w OCR to DTPO is an easier workflow.

korm · November 13, 2014, 10:55pm

You could experiment with different quality / compression settings in the three OCR software engines with a larger sample of documents. I’ve done that test with a representative sample off and on over the years, and found that I get better results on the average by scanning with my S1500M to a folder, batch OCRing the whole folder with Acrobat Pro, then importing to DEVONthink. I’ve rarely experienced the highest quality from DEVONthink’s AABBY engine.

This just happens to be what works for me – the material I scan is low priority, archival material for the most part. My higher priority documents always are generated by software are are machine-readable from inception.

pvonk · November 14, 2014, 12:13am

Yes, I should experiment with more documents; my primary surprise was the fact that Finereader actually cut the file size after performing OCR. Could it be some form of down sampling of the original file? If so, the result is still very good when printing.

You mention Acrobat - I have a number of older Acrobat Pro editions as well as the current CC version. However, since Adobe went to the subscription model, I am moving away from their software as fast as I can. Yes, I do subscribe to CC, and at the academic $9.95 per year I certainly shouldn’t complain. But as I am about to retire soon, I do not plan to continue with their service.

My primary desired workflow is to get PDFs to a Finder folder where it comes in already converted (via Scansnap) or will be converted by some other software (but not PDFpen, not impressed with the quality of their conversion) and moved to an intermediate folder where Hazel renames it according to a set of rules and drops it into a final indexed folder where it is then part of the DTPO universe. Currently, after the Scansnap step w/o OCR, I then use Finereader (the best OCR of the ones I could use, not counting Acrobat), and finally rely on Hazel for the final step.

gg378 · November 14, 2014, 3:00am

Two possibilities in my mind:

Indeed, for certain files (depending on the content), there can be significant differences in size with almost no perceivable quality difference.
There can be a significant amount of background data in a pdf. I had situations where the thumbnail previews took up more space than the actual content! Different programs deal with that meta data differently.

In Acrobat Pro, open a pdf, then use “Save optimized”, the resulting save dialog has in the upper right corner (at least Acrobat 10) a button “audit space” (or similar); pressing it will give you an overview of how much space is used by what. You could compare the two files. Maybe there is something interesting there.