OCR file size issue

reb2012 · August 10, 2020, 2:05pm

I am using DTP3, mostly processing PDFs that I generated from JPGs using DTP2 before the upgrade came out. I recently found a folder I had missed, so naturally used the OCR in DTP3. The result is PDFs that are typically around 4 times as large. I thought a way to overcome this was to reinstall DTP2 and to use it just for the OCR, but the problem is the licence. (I converted enough files with DTP2 in demo mode to verify that file sizes were consistently much smaller, but rapidly hit the limit on the number I could process.)

The response may be that hard disk space is cheap and I could get a bigger hard disk. However, my problem is not disk space but that the larger PDFs take significantly longer to load, so working through a folder of 900 PDFs is a lot more tedious. (My files start out as single page documents, generated from photographs, which I then have to merge into multi-page documents.) Af first I thought something must have gone wrong with the computer, so sluggish was it, and I tried rebooting, before I realised that the issue was (or presumably is) the larger files generated by the new OCR software.

Of course, maybe the answer is that there are some settings I could change to reduce the PDF file sizes, or to speed up loading of files, as I am just using the default settings, but I cannot find any. If anyone knows of settings I could change to improve things, I would be glad to know about them.

I thought about using an external program to compress the files (I have IRISCompressor, but it does not handle multiple files well) but that is not an ideal solution.

I should add that in other ways I think DTP3 is excellent, and this is the only issue I have found with the upgrade.

chrillek · August 10, 2020, 2:38pm

There are already several threads concerning this topic. If I remember correctly, the Peruvian seems to be in the Abby OCR library and the next version of DT should fish a workaround.

reb2012 · August 10, 2020, 2:51pm

Thanks. I thought there might be, but I could not find any threads this afternoon, and I did not know a workaround was planned. I take it that autocorrect has made some interesting corrections to your reply!

Blanc · August 10, 2020, 3:10pm

See this post of mine, for example.

And @chrillek, I’m not sure keeping Peruvians in the ABBYY OCR library is acceptable; it’s certainly something DEVONtech should be actively committed to stopping, rather than hoping for the next version to rumble around trying to solve it itself.

chrillek · August 10, 2020, 3:11pm

Indeed. I was in fact planning to go to Peru later this year, but that’s no longer possible. If and how that influenced autocorrect … no idea.

I wanted to write “problem”. But since going to Peru is one now, the correction seems appropriate.

reb2012 · August 13, 2020, 2:30pm

The issue seems to be fixed in the latest upgrade, which appears to give even smaller filesizes than DTP2 used to create. Maybe it is my imagination, but it seems to work pretty quickly too.

BLUEFROG · August 13, 2020, 2:57pm

Glad to hear it, imagined or not

And yes, there should be some improvements seen.

reb2012 · August 13, 2020, 3:15pm

I am re-processing a folder (sorry Group) with several hundred files that I thought too large, and it seems to be shrinking them by. a factor of around 8.

BLUEFROG · August 13, 2020, 3:19pm

a folder (sorry Group)

No worries. We try not to be too pedantic about this unless there’s the potential for confusion in a discussion.

What resolution are you setting and are you using compression?

reb2012 · August 13, 2020, 3:20pm

I am using the default settings, and always have done. I had not found where I could change the settings.

BLUEFROG · August 13, 2020, 3:43pm

strmd · August 14, 2020, 12:18pm

I may still be seeing some issues related to this, even after holding out for 3.5.2.

If I set the resolution to 200 dpi (which seems perfectly reasonable, that’s typically what I scan at, though sometimes 300), a single A4-page PDF from my ScanSnap (lowest JPEG compression) weighing in at 625 kB balloons to 24 MB and reports a size of 58.5 x 82.5 centimeters. The process freezes up the application and show the spinwheel for a few seconds on my quad core i7 MBP.

Unchecking “Compress PDF” brings it down to 510 kB (though with the same reported print size). Setting the resolution at 150 dpi with compression on results in a 1.6 MB file.

Related to this: May I ask why more options related to PDF compression aren’t exposed to the user? Is it a question of licensing fees or more about not drowning us in details? The FineReader SDK is quite powerful and I’d love to have more control over the settings, like separate profiles I could choose from for different kinds of documents.

MauriceK · August 14, 2020, 2:25pm

I still have problems with the file size. The same original file (300dpi, 144kb) has after OCR with 1.1.0 172kb, after OCR with 1.1.12 723kb and after OCR with 1.1.13 761kb size. And the image is better and closer to the original after OCR in 1.1.0 than after OCR with the newer versions.