Is it possible to parallelize OCR processes?

jstarek · September 17, 2021, 8:55am

Currently, I need to OCR several hundred document scans that are stored in PDF format. This seems to happen in a single-thread process, see the attached screen shots.

Would it be possible to have OCR of many documents run in as many parallel threads as there are CPU cores (perhaps using a low process priority / high “niceness”)?

I understand that the OCR module is a licensed 3rd-party product, so this may be impossible due to licensing issues. But it would be great to be able to use the many CPU cores that are idling currently

aedwards · September 17, 2021, 9:19am

Our licence does not allows for more than one instance of the ABBYY OCR engine to be running at any one time therefore we cannot process multiple documents in parallel. The ABBYY OCR engine will use up to 4 cores when processing a document however how the load is shared across those cores is internal to the ABBYY engine.

palendrome · September 17, 2021, 11:39am

Have you thought about just using a free OCR library and just writing your own threading logic? It’s really not that difficult as most languages support multi-threading.

That’s what I do.

Check out Tesseract OCR.

jstarek · September 17, 2021, 11:45am

I thought so… unfortunately, the library doesn’t seem to see any potential for parallelization on my workload. But thanks for the info nonetheless!

aedwards · September 17, 2021, 12:11pm

We are aware of Tesseract and we use it for OCR in DEVONthink to Go. The ABBYY OCR currently provides more accurate results, however we are continually assessing the available options to improve the OCR.

jstarek · October 1, 2021, 9:21am

Just as a small addition: I’m now past 20 hours of converting and almost halfway through the PDF pile I need to OCR. I’ve never seen the helper process use more than a single core. So, when you’re talking to your contacts at ABBYY next time, it would be great if you could bring this up – it’s so boring to have to wait for all those conversions while there’s so much more CPU power available.

pete31 · October 1, 2021, 9:36am

In case you own more than one mac I guess you could split the task and afterwards sync.

aedwards · October 1, 2021, 12:54pm

In the CPU usage view in Activity monitor I can see activity spread across multiple cores, however there is one core that has a higher activity than the others. If you are running on an Apple Silicon Mac, ABBYY are working on adding native support which may see speed improvements.