DT3 OCR Multithreading

alanshutko · July 31, 2019, 7:21pm

I seem to recall discussions in the distant past that part of the DT3 upgrade and the new ABBYY framework would allow us to do multithreaded OCR. Is that still the case? My tests seem to be doing one thread only.

cgrunenberg · August 1, 2019, 7:31am

Multiple documents are not yet processed concurrently but multiple pages of a document are.

alanshutko · August 1, 2019, 2:37pm

OK, I’ll keep an eye out. I’ve tried OCR on a 98-page PDF and it was running on one core the entire time.

alanshutko · November 7, 2019, 2:14pm

Based on my recent conversations with support, the ABBYY sdk is licensed to “use” up to four cores but multithreading is only supported on Windows. I interpret that as meaning that the ABBYY sdk is single threaded and does nothing concurrently. Specifically, it does not process multiple pages of a document concurrently.

This matches the testing that I have done on my machine where DTOCRHelper seems to be using a single thread and only up to 1 CPU. That usage is spread across any cores just as any single core app is, and it takes the same amount of time to process as DTPO 2 did.

If I am wrong, and ABBYY should be processing things concurrently and taking advantage of multiple cores, please let me know. As far as I can get with support, things seem to be behaving as intended.

aedwards · November 7, 2019, 2:28pm

That is incorrect. The multithreaded option I referred to which is available on Windows is I believe for processing documents multiple documents. The ABBYY SDK on macOS processes a single document on multi-cores using multiple threads.

alanshutko · November 7, 2019, 9:44pm

Oh, that makes a lots more sense.

I’m still not convinced that there’s any threading there, but I will remain hopeful.

cornchip · January 15, 2022, 3:47am

Following up on this: can something be done, either by the user or by DEVON, to increase the parallel performance of the ABBYY SDK in DEVONthink? I am seeing low utilization of the CPU (M1 Max.) This was captured during OCR of about 150 documents totaling perhaps 1500 pages and this particular capture during OCR of a long PDF, when multicore theoretically would have been utilized based on what @cgrunenberg and @aedwards said.

I understand that ABBYY is using Rosetta, and am not concerned about the lower performance per core–I just want to use a lot more of the cores to speed through long OCR queues. If DT has been holding back on OCR speed to preserve battery life and prevent fan noise on Intel Macs, hold back no longer!

If the best option is a secret Terminal command to speed up ABBYY, or a plist adjustment, I’m happy to do it.

BLUEFROG · January 15, 2022, 7:30pm

No, this is not possible to do.
There is a limit to the number of cores we can access and the engine itself distributes work to the cores as it sees fit. You cannot modify this behavior on your own.

cornchip · January 15, 2022, 7:31pm

I see. Thanks, Jim. If there is ever an opportunity to buy a more expensive edition of DT that allows you all to pay ABBYY for a more permissive license, I’d be interested.

BLUEFROG · January 15, 2022, 7:32pm

You’re welcome and the suggestion is noted.

Blanc · January 15, 2022, 7:45pm

+1 on that

Jayboux · January 15, 2022, 8:07pm

Hello!

I am rather new in my DTP3 journey, but this may help you in speeding up your OCR queue.

In my previous workflow I utilized OCRmyPDF.

Simply had a folder that processed any PDF dragged in via ocrmypdf. It also gives you some more control over OCR settings.

And to adress you main point

OCRmyPDF will use all available CPU cores.

Once it chuggs though your documents, they are good to drag into DTP3

Hope that helps

alanshutko · April 13, 2024, 7:22pm

Once again, I noticed that OCR was single-threaded, now on an M3 Max and in researching, came upon this thread again. What’s the current status of this? Does anyone see OCR using multiple cores, even simply two?