Script for sending to Adobe Acrobat for OCR

Cameron · January 18, 2010, 6:43pm

I’ve found that OCRing a document through Adobe Acrobat using “Searchable Image (Exact)” is far superior in quality to Devonthink Pro Office’s built-in OCR feature, and the output is virtually the same size as the original document. It really is that much better in quality.

Does anyone have a script that will take a document in your Devonthink database, send it to Acrobat for OCRing and then re-import into Devonthink, essentially bypassing the built-in OCR mechanism?

Thanks for any help! Just trying to retain the best quality for my documents in Devonthink!

Cameron

cgrunenberg · January 19, 2010, 1:41pm

Is “Exact” or “Fast” in the OCR preferences pane selected? And what quality/resolution settings are you using?

bcarpenter · January 21, 2010, 11:34am

Cameron, I agree with you. Unfortunately, DevonThink don’t. They have repeatedly said in other forum threads that OCR to them is a “black hole”, ie, they pass the PDF to Abbyy & take whatever is spit out.

I don’t have a way to do what you want exactly, but I have a workflow that uses my ScanSnap to scan into a temporary folder. I then have a batch process in Acrobat 8 Professional than processes these by doing an OCR (Exact) then downsampling to a lower resolution at higher compression. Most importantly, bitmapped (1-bit black & white) images stay bitmapped & are usually less than 50kb/page at good quality, whereas DTPO changes these to greyscale, decreasing the image quality & increasing the file size. (Note that this is an Abbyy problem, not DTPO, but Devonthink don’t want to know about it.)

So, the batch process spits out files into a separate folder & I drag these processed files into DTPO.

I have tried to Applescript the whole process, but Acrobat 8 Professional is not fully scriptable
I have read various reports that this is intentional to prevent taking work away from
Adobe’s more expensive products which are specifically geared for fully scripted activities.

Cameron · January 21, 2010, 5:06pm

After some testing with DT, I typically would use 200dpi at 75% quality. I found it to be the best compromise between file size and quality.

viewtopic.php?t=8897

But, after I experimented with Acrobat, I was surprised that the OCR’d document from it was undistinguishable from the original even when zoomed in to an excessive level, and the file size was almost identical in size.

cgrunenberg, I am unsure about the Exact of Fast setting. In Acrobat, I have “Searchable Image (Exact)” selected, but do not see a Fast setting so I may be in the wrong place to answer your question.

Bill_DeVille · January 21, 2010, 11:21pm

Most of my scans are done using a ScanSnap S500M.

I’ve experimented with a variety of copy, including color content and a variety of fonts and font sizes. I’ve run the same original PDFs through OCR with both Acrobat and DTPO’s ABBYY OCR.

Overall, I find the text recognition accuracy of ABBYY OCR better than that of Acrobat OCR, sometimes significantly better. I’ve also tried ReadIRIS Pro 12, and find that ABBYY produces consistently more accurate results.

It is true that Acrobat uses its own proprietary code to produce the image layer of searchable PDFs, whereas ABBYY uses the code built into OS X. Because OS X rasterization isn’t as file-size efficient as Adobe’s, the resulting file size is larger for searchable PDFs produced by ABBYY. (Perhaps one of these days Apple will optimize image layer production.)

DTPO Preferences > OCR provides user-modifiable options to control the final resolution and image quality of ABBY-OCRd searchable PDFs. The default setting is a compromise between view/print quality and file size, 150 dpi and 50% image quality. I usually set these at 150 to 200 dpi and 75% image quality and find the results quite readable and with acceptable print quality.

For copy that’s primarily text, I usually use Black & White scanner settings. The Automatic Color detection setting of the ScanSnap treats pages without color content as B & W, and shifts to a lower-resolution mode including color if a page has color content. The Best setting for the ScanSnap provides 300 dpi resolution for color, with high contrast in the text areas.

I also have a CanoScan LIDE 500F flatbed scanner, which I use to copy bound material (books and journals). I prefer the ExactScan Capture mode to Image Capture for this scanner, and get high-contrast, easily readable text with the settings I use. When I scan a book page (with white background), if in the scan preview the background is not white but gray, I’ll adjust scanner settings to produce a white background. That’s quite easily done in ExactScan Capture, but for Image Capture it’s best to calibrate the scanner, or switch to the Text mode.

bcarpenter · January 22, 2010, 12:12am

Yes, I should have mentioned what Bill just said about OCR quality as well. Even though I use Acrobat rather than DTPO/Abbyy, the OCR recognition itself is not as good in Acrobat. In particular, it doesn’t do white text on dark background well at all.