Is it possible to parallelize OCR processes?

Currently, I need to OCR several hundred document scans that are stored in PDF format. This seems to happen in a single-thread process, see the attached screen shots.

Would it be possible to have OCR of many documents run in as many parallel threads as there are CPU cores (perhaps using a low process priority / high “niceness”)?

I understand that the OCR module is a licensed 3rd-party product, so this may be impossible due to licensing issues. But it would be great to be able to use the many CPU cores that are idling currently :wink:

Our licence does not allows for more than one instance of the ABBYY OCR engine to be running at any one time therefore we cannot process multiple documents in parallel. The ABBYY OCR engine will use up to 4 cores when processing a document however how the load is shared across those cores is internal to the ABBYY engine.

Have you thought about just using a free OCR library and just writing your own threading logic? It’s really not that difficult as most languages support multi-threading.

That’s what I do.

Check out Tesseract OCR.

I thought so… unfortunately, the library doesn’t seem to see any potential for parallelization on my workload. But thanks for the info nonetheless!

We are aware of Tesseract and we use it for OCR in DEVONthink to Go. The ABBYY OCR currently provides more accurate results, however we are continually assessing the available options to improve the OCR.

2 Likes

Just as a small addition: I’m now past 20 hours of converting and almost halfway through the PDF pile I need to OCR. I’ve never seen the helper process use more than a single core. So, when you’re talking to your contacts at ABBYY next time, it would be great if you could bring this up – it’s so boring to have to wait for all those conversions while there’s so much more CPU power available.

In case you own more than one mac I guess you could split the task and afterwards sync.

In the CPU usage view in Activity monitor I can see activity spread across multiple cores, however there is one core that has a higher activity than the others. If you are running on an Apple Silicon Mac, ABBYY are working on adding native support which may see speed improvements.