Using DT3b4 here. I notice that PDFs that are converted with OCR to PDF+Text consistently result in much larger file sizes. Last example is a two-page NYT article which is 353kb as downloaded from their website, and when converted weighs 2,5mb. Surely the added (hidden) text doesn’t represent that much data, compared to the image element of the PDF?
Is this still true for DT3b7? I’ve seen that the Abby engine was updating after I’ve installed b7.
And if the PDF is still completely recreated during OCR in DT3, would you agree, that at least for Scansnap Home users it is better to disable the OCR within DT3 and instead do the OCR within Scansnap Home? As each recreation of the PDF applies lossy compression to the picture layer, the scanned image in the PDF gets worse with every recreation.
For what it’s worth, I also run ABBYY Finereader as a standalone app (12.1), and this business of increased file size is still an issue. There are slightly more opportunities for tweaking your scan output, e.g. you can set compression and image quality by way of a slider. I’ve always wondered, actually, whether there’s much difference between the ABBYY engine that ships with DTP or the standalone app in terms of managing file size or OCR accuracy (but then again, never wondered long enough to actually run some tests!).
I have some wild numbers with OCR, some 1-2mb files are coming out over 100mb. Unfortunately it appears random, or at least I can’t see a pattern yet. My database is going to blow out pretty quickly at this rate.
I don’t have time to do more detailed tests but here’s something I tried quickly : doing OCR from within DT and, separately, in FineReader, and comparing the results. The first four files are from the FDR Library (typewritten text from the 1940s), which have already been OCRed but often with subpar results, so it never hurts to scan them again with a modern engine. The last file is from the NYT archives, which isn’t OCRed at all when downloaded.
If you read my original post, this is the PDFs I’m having trouble with (from NYT), and this is confirmed by the random tests below: all files are reduced in size when OCRed by DT and increase in size when going through FR (default settings). However, the NYT file increased in size on both cases, although I can’t understand how adding text to an image can result in such a difference.
So from this simple test there doesn’t seem to be an advantage of using FR instead of DT for file size reasons.