OCR in DTPro 2.7.6 increases file size dramatically

elwood151 · November 1, 2015, 11:04am

I had a database with about 1000 PDF files, most of them scanned without full text content and I used DTPro Office 2.7.6 with the OCR feature to add the text content.

I used the settings “same as scan” and 95% and ended up with files which had a dramatically larger file size:

for example a pdf file which had 1.5 MB before OCR ended up with 18.8 MB size!

How is that possible? I’d like to keep the original quality, but the text layer itself can not need 17.3 MB, or can it?

Could somebody please explain what’s happening here and how to get a better result?
Is the image recalculated again and loses compression?
Is there any post-processing I can do with the OCRed files (command line batch process preferred) to reduce file size and preserving the text layer?

As I said I’d like to keep the image quality as it is and just add the text layer.

Martin

gg378 · November 2, 2015, 2:45am

It is not the text layer. Depending on the initial quality of your document, the 95% setting might make the jpeg unnecessarily large. However, I doubt that this would create such a blowup in size.

I was originally also astonished by the large files produced. I used 300 dpi as output resolution and 75% jpg quality. Yet, the files seem to be much larger compared to using 300 dpi with Acrobat Pro 10. As a result, I switched to using Acrobat.

As an aside: Recently I found out about the “ClearScan” mode of Acrobat (up to 10, I think, now it’s been replaced by something better), where actual vector fonts get created and substituted for the image layer. For suitable documents, this can drastically (10x) reduce document size (at the cost of not being able to see the original document, so if something is wrongly recognized, you won’t be able to judge later on). This works amazingly well, but the fonts are unfortunately really ugly. For “utility” docs, this could be fine. For longer texts, I don’t like it.

I finally purchased ABBYY Finereader, which has a similar mode, but replaces with crisp, “real” fonts (I believe the new Acrobat versions do that, too).

Here is my experience so far:

DT built-in ABBYY engine makes the docs simply too big (I’m sensitive to that, because most of them get synced to my iOS devices, and 70 MB docs bog down the sync).
Acrobat Pro does an excellent job in two ways: (i) the filesize is quite small (even if using the image layer method), and (ii) little tweaking is needed, i.e. you throw a document at it and it just gets the job done.
Full ABBYY Finreeader 12 seems to me more sensitive than Acrobat. If I hand it some document, it will on many pages complain about “wrong DPI setting” etc. If you the tune the settings correspondingly (but often on a per-page basis), you can get amazing results; especially when not using an image layer, in that case the files can be tiny, and the text crisp and fully enlargeable.
(Maybe I have not discovered all features yet, but I wonder why it does not have an “auto” mode, where it simply re-sets the DPI to the setting that it suggests to me).

elwood151 · November 2, 2015, 4:54am

Hi, thanks a lot for your detailed reply!

I still hope there is a way to do it with DTPro, as I do not want to invest in another software…

In the meantime I found this excellent article,

macworld.com/article/2043857 … g-ocr.html, which claims that scanning with 150 dpi and then OCRing with 150 dpi/50% should produce good results.

However. as I already have OCRed all the pdfs, I do not only wonder how to do it better next time but before all how to correct (automatically in batch processing) those I already have: seems I have to compress the graphics?!

gg378 · November 2, 2015, 6:44pm

Try it and see how it works for you. In my case, if I scan some arbitrary work memo that came on paper, I don’t really care how it looks, as long as it is searchable and readable.

But then there is the other type of document: something I captured from paper that I want to read and enjoy, and is also possibly somewhat longer. Doing intense reading in the presence of jpg artifacts or fuzzy fonts bugs me. Then you need either at least 300 dpi or the font substitution techniques mentioned in the post above.

If you have existing OCR’ed docs that are too large, you can run them through a utility like PdfShrink, where all images can be re-compressed. Works beautifully in general (however, I have not tried it on OCR’ed stuff so far). Make sure not to strip metadata from the file, so that the text layer doesn’t vanish. And test thoroughly first before going into production.

Downside: This utility is not free. But to be honest, for those of us who deal with large amounts of such documents, the “time is money” statement holds, and I believe it is simply necessary to spend the money on a suite of tools: DTPO, Acrobat Pro, possibly ABBYY, PdfShrink. There’s no way around it. It bothers me when some people complain about the price tag of DTTG. For anyone who is making half-decent use of it, it’s a fantastic deal, even if it is 15x more expensive than the average fart app.

scottlougheed · November 6, 2015, 11:41pm

While I appreciate having OCR built into DTPO, I have rarely found it to be practical, usually the opposite. Recently I had to OCR a large number of scans for a course pack I’ll be using for a course I am teaching. The 175 page document became just unreasonably huge, and reduced DPI and increased compression resulted in an unbearably bad quality degradation.

I ended up OCR’ing it in PDFPen Pro, and while the accuracy isn’t top notch the file size and quality remained the same. Since the scans were good to begin with the accuracy isn’t as big of an issue (and as long as it is accurate enough to be reasonably searchable and the text selectable for students to highlight… accuracy not an issue for the latter use).

Acrobat’s ClearScan (or its successor) is by far the best in terms of both visual quality and file size. Normally this would be the route I’d go if I could justify the Acrobat subscription (I’d need to do a whole lot more OCR’ing to justify that!)

It’s a bit disappointing that I find the OCR’ing generally unusable in DTPO.

scottlougheed · November 26, 2015, 1:18pm

Just a quick follow-up:
While I have complained here and elsewhere about the resulting size of PDFs OCRd in DTPO, I have to admit the accuracy of the OCRing is hard to beat (I think Adobe ClearScan is really the cats pyjamas in terms of accuracy and file size… but adobe ).

I use large government documents for one aspect of my research. These documents tend to be (for some reason I can’t explain) scans of paper producing images of text. These scans are reasonably good in terms of DPI but they can be a bit noisy and generally poor. These can range from 150 pages to 2000 pages. Recently on a smaller document (175 pages) I thought I’d test DTPO versus PDFpen Pro to see:

Who was faster,
Who produced the smaller file
Who produced a better looking file

PDFpen produced a smaller, better looking file at a smaller file size a fair bit faster than DTPO. The settings I have for quality in DTPO did not bring the size down to to comparable level and produced an uglier PDF.

Since the PDFpen file was done first I began searching through it and trying to select text here and there and so on. While in general it did okay, it was clear that the less-than-ideal quality of the scans were a bit of a challenge. Lots of poorly aligned elements of the OCR layer, many inaccurate or incomplete words, and so on. Not a big deal, but I generally need to search these documents for specific words so accuracy is somewhat critical in this case. I don’t want to track down a keyword in a 500 page document manually because the OCR failed to detect it.

Once the DTPO version was done, I tried some of the same text-based things, searching, selecting words, using the “lookup” feature in Mac OS, and found that in many places where PDFpen Pro struggled to OCR, DTPO succeeded like a champ.

So, the moral of the story is, OCR is about tradeoffs. In this specific use-case I am burdened with a massive PDF no matter what. Whether the file is 60mb or 90mb isn’t terribly important (though of course I would prefer the smaller size) because it doesn’t interfere with my ability to use them. Here, accuracy is more important than file size. Since it isn’t a lengthy monograph, compression artefacts are also not the end of the world (though I certainly don’t like them). In this case DTPO is the clear choice.

In other instances where scan quality is better, PDFpen may do a better job with a smaller file size and marginally better image quality. This would be important where file-size is matters more, such as for things I might store or need to access on a mobile device.

So, I think the gist is, the suitability of DTPO as an OCR tool depends a great deal on your needs. It doesn’t always work for my needs, but its high level of accuracy really does win out in this specific case for me.