OCR problem with files longer than 30 pages

taglia · February 4, 2007, 11:59pm

Hi,

I have a problem / suggestion. Even though I’m on a Dual Core 2GHz with 2Gb of ram, I cannot manage to use the OCR on documents with more than somehing like 30 pages. I used it successfully on a 15 pages article, but not on a 35 pages one: the hd starts spinning, the available memory drops to zero and the swap file jumps up to 2Gb. I waited for 15 minutes, but the process was stuck with the initial window (the one before the IRIS popup). When I stop the OCR process, the memory allocation goes back to normal. Reading the forum others do not seem to have this issue, do you think I should wait longer? Is the massive memory allocation normal?

And now one suggestion: it would be handy to have the possibility to limit the OCR conversion to the first N pages (even for a long article the topic should be pretty clear from let’s say the first 20 pages). This could be a general option, in the preferences.

Thanks for the great product.

Best,
Cesare

Bill_DeVille · February 5, 2007, 4:18am

Hi, Cesare. I’ve successfully run many 50-page scans to OCR to database on my MacBook Pro dual core 2.0 GHz with 2 GB RAM and on my Power Mac G5 2.3 GHz dual core with 5 GB RAM.

OCR requires uses a lot of computer resources, when you think about what’s going on. Especially if there’s not much free RAM available when you start scanning, a long document may take a while, although there are some things you can control.

I use a little preference pane called MenuMeters (Google it). That lets me monitor the activity of both CPU cores and the amount of free RAM available.

Free RAM is important. Although Apple’s Virtual Memory will let a memory-intensive operation proceed to completion, it does so by swapping data between disk and RAM, using VM swap files. By comparison to the speed of memory operations in physical RAM, manipulating memory from disk is horribly slower.

If I’ve got little free RAM available, I know that a memory-intensive procedure that I’m about to start (such as scanning and OCR) may prove slower than I’d like. So I can quit some other applications I’m not using at the moment, and quit and relaunch DTPO to free up some more memory. If I’ve accumulated large VM swap files, I may simply reboot before doing scan/OCR.

If you have a large DTPO database, it may be using up a lot of memory to load. You may find it efficient to create a new, empty database when you are going to scan many paper documents. This will free up your computer resources (especially important if you have limited RAM) so that the scan/OCR process will be noticeably faster. Later, you can move the new OCR’d files to their appropriate database.

Option to OCR only the first x pages of a PDF? Personally, I wouldn’t be satisfied with that. I’ve got many long PDFs in which the information in the last y pages may be more important for my searching and analysis needs than the information in the first x pages. It’s not just what the document is about; it’s about the information that it actually contains.

I’m not even certain that such an option could be implemented, given the OCR plugin we are using.

Any others want to comment on that suggested option?

taglia · February 10, 2007, 2:52am

Hi Bill,

Sorry for the delay, it was a busy week. Thanks for your help, it actually works, I just assumed it hanged and stopped it before it could finish the job.

I asked for the additional option mainly because I thought there could be a problem with long text; as it works fine I agree with you, much better to have the full content!

Best,
Cesare