OCR and PDF saving performance

I would like the OCR unit to be multithreaded to take advantage of new 4-core and 8-core Mac Pro models, and I also like to see PDF files saved using JBIG2 compression algorithm (which makes the file size considerably smaller than JPEG encoding, but without losing quality, especially for B&W documents) after OCR. Right now Acrobat 8 or even Acrobat 7 would do a better job…

Regarding multithreading, it doesn’t have to be a complex algorithm. Each processor can work on one page at a time.

Thanks,

Ryuji

No argument here. :slight_smile:

DEVONtechnologies is licensing the IRIS OCR engine.

Like Bill said, the engine is licensed and a black box for us. That said, being multi-processor aware may not be a panacea. I don’t know their algorithms but looking at the memory usage for some documents the problem may not be only CPU bound. In that case throwing a dozen cores at it will not make a difference if your machine is swapping itself to death.

As to the jpeg compression: I did some tests with it and found the display of PDFs to be excruciatingly slow on Tiger (we’re talking about seconds here to display a page). Just scrolling through a multi-page PDF was pure agony both in DTPO and Preview. I will keep an eye on it but for now this is the best compromise I could find.

I’ve also monitored memory usage during OCR, but not a problem. I can run Lightroom, Photoshop and DTPO with OCR simultaneously. I also routinely run two OCRs, one in DTPO and one in Acrobat 8, since these are so slow and don’t use more than one processor. OCR may use a lot of memory, but i see no evidence that two OCRs on two cores on my wimpy PowerMac and Macbook are limited by memory, paging or I/O. CPU usage is very high.

If I scan many documents in a row, and if my machine has N cores, it would be very useful to set a maximum number M (where M < N) in preferernces panel, where multiple documents, up to the maximum number, are OCR’ed simultaneously. Same thing could be done when I select multiple files and covert them to searchable PDF.

I also compared the display speed of JBIG2 and JPEG encoded images in PDF files on Tiger and Leopard. The PDF files of both types were generated with Acrobat 7 and 8. In Tiger, JBIG2 based PDF was only slightly slower to show up at first (despite much smaller file size), but it wasn’t that slow, and it was still useful for large documents (50+ pages). In Leopard, both formats get rendered very fast and the difference is unnoticeable.

I have 100+GB of scanned document in my hard drive and with JBIG2 it could be cut down to 1/10 (while keeping the document at a higher DPI resolution). Although disk is cheap for desktop machines, this makes a huge difference for Macbook, and I carry DT databases in both machines. My libraries are expanding and soon it won’t fit in my 320GB drive in my macbook.

I’ve been scanning to file and then batch OCR using Acrobat 8, since it makes comparable quality OCR and much smaller PDF files (even at a higher DPI resolution).

If Devonthink can’t incorporate JBIG2 compression algorithm quickly, can it at least implement a function similar to “make searchable PDF (exact)” in Acrobat? That is, the original image is untouched. The file size expands only by a small fraction due to invisible text elements but the original image should stay untouched, without downsampling and JPEG re-encoding.

Thanks!

Hi, resurrecting this post, as I have a related question on OCR and PDF performance.

I have a hundreds of scanned files that need to be OCRd and input into DEVONthink. However, when I have trialled this on a few (multipage admittedly - sometimes 50pages long) files, it has taken hours and hours! Ive tried altering the settings but to no avail. Ive tried on my snow leopard MacBook 2006 Core Duo, as well as on a MacBook Air and both are very slow (latter faster as one would expect). Ive subsequently added an SSD to my MB CD, so maybe that will help (yet to test).

Is this just the way it is with DT or have a missed a critical setting. If I were to buy a new Mac, which specification would give me the most bang for my buck specifically relating to this import and OCR process (yes I am that desperate!!).

Many thanks,

Rich.

@richyo, you didn’t explain the content of these PDFs – if they were scanned from printed documents, images, book scans, etc., then OCR will be very slow depending on the quality of the image, its skewness, and so on. Memory, etc., will help. You’d probably be better off launching the OCR on one of your machines that you dedicate to the project, with as much free memory as possible. Don’t run any other app. Even DEVONthink has one of the largest (and getting worse) memory footprint of any app I run, so a dedicated OCR engine, such as Acrobat (which can also do batch mode) might be the way to go. But the quality of the original determines everything.

From memory, my experience with a 2011 Mac Mini is that OCR’ing a 50-page document takes between eight and 15 minutes: what seems to take most time is the collation of the pages, and this tends to suck up CPU effort rather than purely RAM.

This effort does tend to discourage simultaneous use of the computer for other CPU-intensive purposes. However, what one can do is to create a folder outside DevonThink, attach the folder action “Devonthink - Import, OCR and Delete” to that folder, then dump ones’s files in that folder and leave Devonthink to get on with it in any down-time.

P.S. Make sure that in Devonthink, Preferences > Import > Destination > Select Group is not selected.

Hi, thanks for both replies and suggestions. All of the files are scanned paper docs passed via a scansnap. Quality is pretty good I think. I may give acrobat a try to see if it’s any faster and intend to batch process overnight so will give that a shot. I’m maxed out on memory already! Looks like I may need a CPU boosted mac model next. Excuse for a new imac perhaps?! What processor configs would actually most benefit - more cores, etc???

Thanks again

At this time, the OCR engine in DTPO does not use multiple cores. It’s been requested as a feature request, and I know Bill’s mentioned they keep having discussions with Abbyy on licensing the engine for multiple cores, but it hasn’t happened yet.

So right now, the best you can do is get a high clock speed or use something with TurboBoost which speeds up things for single-processor workflows.