OCR Speed on macs somewhat generally - anyone with a mac pro

Hi,

I do a large amount of OCR, typically in Acrobat and sometimes in DTPro. I am using a recent iMac with a 3.4 ghz i7 and 24 gb of ram. I am debating upgrading to a mac pro, with the main purpose being OCR. To give you an idea, I have a set of documents that is 6 million pages that need OCR and that means literally weeks of computing time. I am trying to get some idea of how much faster a relatively low end mac pro would be versus the machines I am using now. Anyone with thoughts or machines they could test out just to see how fast say a given pdf takes to OCR I would be very interested in discussing. thanks all.

chris

With 6 million pages to process, you might want to look into a server-based high-volume OCR solution – ABBYY, Iris, etc.

DEVONthink and Acrobat on the desktop are only going to give you a single threaded approach – one document at a time, page by page. You can’t run multiple concurrent instances of the same OCR engine with those products. Even if you could manage 1 second / page on the average – which is highly dependent on the quality of the scan – you’d need to push that thread non-stop for over two months. :open_mouth:

Christian has mentioned that a future release will upgrade ABBYY to support multiple cores, but there’s no published timeline on when that will happen.

PDFPen uses the OmniPage engine and does support multiple cores. But six million pages? YIKES!

Thanks for the thoughts. I am curious to try pdfpen. Does it do batch OCR? I guess I could look.

Yeah, I have thought about a server option, but then I am getting a PC?? Plus spending lots of money. Honestly, letting it run for weeks is not particularly a big problem and 6 million pages is a real number, but definitely at the high end of what I deal with.

PDFPenPro does not do batch OCR. Personally, I like the results from Acrobat the best and find it the easiest engine to work with.

thanks. but that leaves me asking the question: how much would a mac pro speed up acrobat’s OCR engine…ahhhh.

i looked at abbyy recognition server, it seems that its at least 1500 for software plus buying a PC.

The following scheme might be crazy, and outright unworkable, but just as a thought, I’ll throw this at the experts who can probably debunk this quickly:

Like Korm I like Acrobat’s OCR facilities. I have not explored this in detail, but the “Action Wizard” in Acrobat Pro (I have version 10) offers OCR, so batch-OCR’ing should be possible.

So far so good. Here is the crazy and possibly wrong idea: While the OCR engine in Acrobat is, as far as I know, single-thread, it is generally possible to launch the same application multiple times in OS X, using

open -n /Applications/my.app/

If (and that’s a big if) this works with Acrobat, one could put N versions of Acrobat on N cores, and run in each a batch action working down files in parallel. That could potentially allow pseudo-multi-core OCR operation.

Chances are that the licence of Acrobat would not allow this.

is there a way to force it to run on another core? i guess it would just default to another core?? interesting idea for sure.

When I run a single instance of OCR in Acrobat, I grabs pretty much 100% of one core (actually hyperthread). So, yes, other instances would simple grab another one. I think that part should be no problem whatsoever (might want to turn “app nap” off on Acrobat, as only one instance would be in the foreground.

I suspect that this will not work because Acrobat has something in it that will prevent starting multiple instances.

UPDATE: Just tried it, and I was indeed able to start two copies of Acrobat on OS 10.9!

Not sure this was cleared above:

In Acrobat Pro XI Tools > Text Recognition > In Multiple Files is what I use for batch OCR. The command seems to move around in different version of Acrobat Pro.

cool, i will try it out.

Bummer! I must have seen the “multiple files” button hundreds of times. So Korm is of course correct, no need for the Action Wizard. I have never used bulk conversion, as I only convert old files (scanned scientific papers) whenever I stumble upon one I believe I need in the future. And then I individually check that the conversion succeeded (rarely the single-file procedure stalls, because Acrobat reports a problem; I wonder how that is handled in the multi-file conversion; if an “all-night” queue stalls in average at 1am, it would be annoying; so the multi routine better not stop with a popup panel, and rather jump over that file and make a log entry, so that those rare cases can be worked off individually next day).

So I think the pieces are in place: Multi-file conversion in Acrobat is easy, and running multiple instances on a many-core Mac should work.

Gelbin: If you try this, please report how it went!

update:

i have two instances of acrobat pro running OCR on about 5000 documents each. it is working, though my iMac’s i7 8 cores are all pretty well maxing out. guess that means no logic audio work tonight!

so, so far, I have proven that you can double your parallel pipeline, use all your cores, and slow your machine down. the question that still remains though is the important one…will it actually OCR the pdfs quicker than had i used one instance and not bogged the machine down as much. I will report back when i have some more data…

you might want to try utilizing only 4 cores. the additional 4 are hyper threaded and might actually slow down the entire process.

bosie - not sure how i would limit it? it seems that acrobat uses 4 cores on one instance (contrary to what others thought it might do). When I added the other instance, the 4 other cores kicked in, but bounce around a lot, however all 8 are 50+ percent a good bit of the time (making the machine not ideal to use for other things).

Update - I did a quick calculation after about 2 hours of running and I am getting about 0.8 pages OCR’d per second with 2 instances running. That would mean that my 6 million pages would take about 2 months on this one machine. Though that’s a long time, its not completely out of whack.

The original inquiry was focused on whether it made sense to purchase a mac pro to do this task. i still have no clue of the answer. I could get 2 decently spec’d iMacs for roughly the same price and have them both cranking on documents, or…?

still looking for additional thoughts and suggestions, all are welcome

Just a little whack. :laughing:

Don’t forget – you need to do some QC, fix problems, restart when the power fails, etc.

just to be clear, i am not saying it does not suck. This is the kind of thing i wish x-grid would help with, but…

so yes it sucks. for anyone that has had to maneuver tens of thousands of pdfs, just opening folders and selecting and copying files can shut your system down…so, yes, its whack, but just how whack is the issue! :mrgreen:

Sorry if this has already been suggested, but have you considered a dedicated server solution for this? Seems like there are commercial solutions for this very type of problem.

Not that don’t want OCR to be faster too!

Like a PC server? Yeah. But that and running dedicated software seems like it would be a lot of money. The software I saw also was limited to 100000 pages per year. Which is crazy and still cost 1500$. I don’t do this all the time. But basically at the start of a big case.

Remember that with multiple machines, you can run at most 2 copies of one purchased licence (this applies to Creative Suite 6 and earlier, not sure what their new licence says for the “rented” version).

Maybe Korm knows more about using the multi-file OCR, but my use has been restricted to single file use. And my statement about “single-thread” refers to that. So maybe your observation shows that for multi-file OCR, Acrobat can use more than one core. Then your problem would be solved without multiple instances!

I do not understand exactly what you are after: First you want to OCR a zillion files, which basically means that your machine needs to go full tilt night after night on this. Then you find that Acrobat uses all the cores and then you complain “No logic audio tonight” and that your machine “slows down”; well yes, it’s doing a lot of OCR. If you want to do other things on the side, then you should go with one instance of Acrobat, using the 4 cores (or HT?), and use the other ones for your other work. Of course, you’ll most likely lose speed. So you have to make up your mind. I thought you could run this machine over night in dedicated OCR mode.

If you think about getting a new machine for this, it seems to me that a cheaper PC box with Win Acrobat might be most cost effective. And then you would run nothing but the OCR on that machine. All the fanciness of the MacPro is not really geared well towards this rather mundane, CPU-intensive job. Unless you’d like to have a new MacPro anyway. You could possibly dig out some not-too-old PCs somewhere to set this up. The cost would then be a copy a Acrobat for Windows. With the new scheme, you could get a “rental” only for the period you do this OCR project.