New OCR engine in DTPO 2

skaertus · January 30, 2009, 11:43pm

According to the announcement in this link (devon-technologies.com/scripts/wordpress/), DEVONThink Pro Office 2 will have a new and different OCR engine. I must confess I am a little bit uneasy with such news, as I have just bought DTPO because of its OCR capabilities.

So far, DTPO has used the Read IRIS OCR engine. I suppose DTPO uses the latest version of the Read IRIS OCR, which is Pro 11, but I am not sure if this information is correct. Results so far have been good: recognition of characters is precise, given the PDF text is well scanned. However, I noticed that the DTPO OCR engine is slower than the OmniPage engine found in Microsoft Office OneNote 2007 (for Windows).

However, it seems that DTPO will start using ABBYY OCR engine from now on. In ABBYY website (www.abbyy.com), I could see that Fine Reader Engine 8.0 is available for MacOS. The current version of ABBYY Fine Reader for Windows is 9.0, so there could be some difference between them.

Version Tracker (at this link: versiontracker.com/dyn/moreinfo/macosx/31508) announces the ABBYY OCR engine as being faster and more accurate than the IRIS engine.

I wonder, how do these two engines really compare? Which is faster? And which is more accurate? Which are the advantages and disadvantages of exchanging engines?

roberthoodphd · January 31, 2009, 1:13am

I would also be interested, just out of curiosity, in hearing the advantages that DT’s developers / gurus see in moving to the ABBYY engine. (faster ocr? better memory use? better background processing of multiple documents?) I’m looking forward to what I fully trust will be improvements in scanning as we move forward with new beta versions, and on into the next production release. For me, version 2 has been a wonderful improvement of an already great application----Except for scanning.

After doing lots of analysis on the merits of maintaining books (storage costs, use of space, dust, costs of moving, lack of search capability vs ease of reading, portability etc) I decided to start scanning all my books. I prefer reading on screen (can highlight in skim without changing the original, can search etc).

It straightforward: have an office supply store (like Kinko’s) cut off the binder, run them through scansnap, then save as pdf, then OCR (for what it’s worth: I tried using the hugely expensive brand-new color copier and scanner in my department—actually the scansnap works much better–no jams, just works). A 300 page book takes maybe 15 minutes to scan, and the resulting file is between 30-60 meg. But scanning in a book in DTb1 caused DTb1 to fail. So I tried scanning and saving to pdf asa file, then importing and using DT for OCR–but the OCR part also caused DTb1 to fail. I was advised that the OCR engine in the b1 was “re-rasterizing” the image, and so the files would get large. Well, ok: but how large? I have a Macbook pro w 4 gig of ram and 200gig of free hard drive space, so my working assumption is that my machine has adequate resources…

Of course, it’s beta software, so things aren’t perfect yet.

BUT I would love it if DT would work at least as well as Adobe Acrobat Professional. At the moment that’s how I’m doing OCR. I can ask Acrobat to perform OCR in batch mode, dump the scanned files in a folder, then set Acrobat loose on the folder, and in the morning have a dozen books recognized with searchable text. But that workflow is cumbersome, Acrobat crashes not infrequently, and mainly I would just prefer it if I could just use DT.

So here’s hoping that scanning and OCR that in the final production version of 2 will be an improvement on other things on the market, and that this aspect of DT2 will work as well as everything else!

Bill_DeVille · January 31, 2009, 1:27am

The ABBYY OCR engine will produce equal or better accuracy, smaller searchable PDFs and recognizes more languages, including Hebrew.

Back in Classic days, ABBYY FineReader was head and shoulders better than other Mac OCR applications (I bought all of them). I used it heavily.

ABBY was slow in rewriting their software for OS X, and when they did so, for a long time it worked only on Intel Macs. Not until about 6 months ago did the engine available for use with other applications become compatible with PPC Macs. The DEVONtechnologies developers did not want to limit DTPO2 to Intel Macs,as some users still have PPC computers.

Timotheus · January 31, 2009, 9:23am

But is it true that DT 2 is still using version 8 of ABBYY Fine Reader? The most recent version is 9.

Bill_DeVille · January 31, 2009, 5:58pm

DTPO2 uses the current version of AABBYY’s OCR engine as supplied to developers; coding of the interface and controls is done by DEVONtechnologies. It will run on Intel and PPC Macs.

skaertus · January 31, 2009, 6:21pm

According to the information I could retrieve from the ABBYY website, the ABBYY Fine Reader Engine is “an SDK for integrating optical character recognition (OCR), barcode recognition and PDF conversion technologies into applications” (abbyy.com/sdk/). According to the description provided by the website, this should be the engine used in the forthcoming version of DTPO.

The website also states that the latest version of the Fine Reader Engine available for MacOS is 8.0. So, unless I am missing something, this should be the version of ABBYY’s OCR engine supplied to developers. However, version 9.0 is available for Windows only (not for MacOS) and it appears to have some great improvements (such as the “re-creation of document logical structure and formatting attributes including headers, footers, page numbers, fonts and styles and more”). DTPO will use the current version of ABBYY engine; however, I think it would be the latest version available for MacOS (i.e., 8.0) and not the latest version overall (i.e., 9.0, because it is only available for Windows). Could someone clarify these points? Does anybody know if ABBYY is going to update its MacOS engine for DTPO?

skaertus · February 2, 2009, 10:20pm

Well, answering my own question, this webpage at ABBYY website brings the features of the Fine Reader Engine for MacOS:

abbyy.com/sdk/?param=143507

In comparison, these are the features offered by the Windows version of the ABBYY engine (9.0):

abbyy.com/sdk/?param=146765

And these are the features offered by the previous DTPO OCR engine (ReadIRIS):

irislink.com/c2-230-189/RI-P … tures.aspx

bonedo · February 2, 2009, 10:37pm

If you want OCR enabled again, it’s easy: just manually adjust your MacOS clock in such a way to display a date before January 31. If you do so, DevonThink Pro Office beta 1 will run again! Yes, an easy way to turn back time…

Timotheus · February 3, 2009, 6:14am

But … did Devon make the jump from Iris to ABBYY simply because they think ABBYY is better, or perhaps because there were problems of some kind in the relationship with Iris? In other words, was it a free choice or (given the fact that there are not many alternatives on the market) a forced one?

Bill_DeVille · February 3, 2009, 6:51pm

The change is chosen, not forced. But the option was not available prior to the release of the current AABBYY OCR SDK about 6 months ago.

OCR had to be pulled from DTPO pb2 because if wasn’t performing consistently on all Macs, and wouldn’t run at all on my MacBook-based Mac.

Progress is coming along. This morning I successfully converted image-only PDFs in my database, imported image-only PDFs with OCR and sent searchable PDFs from my ScanSnap to my databases.

Rodney · February 3, 2009, 11:17pm

That’s great! I can’t wait to get the next beta with OCR re-enabled. I have a stack of paper waiting with no other OCR available. Thanks for the info.

convergent · February 4, 2009, 1:56pm

Me too… this lack of OCR is turning out to be very disruptive for me personally. While ScanSnap can do the OCR itself, but then not move documents to DT, for other things that I get electronically and are already in PDF it is more disruptive to convert them to searchable. So I now have a big queue of things sitting that aren’t in DT thus making DT’s database incomplete. Glad to see the new engine is working.

Rodney · February 9, 2009, 8:34pm

Any updates on the status of OCR? An estimated release date perhaps?

eboehnisch · February 11, 2009, 5:14pm

Please see my blog

convergent · February 13, 2009, 4:26pm

Just downloaded the new PB3 with the new OCR engine. I’m so glad its back.

From what I’ve seen so far, it seems to be taking a LOT longer to process my scans. I scanned in about a half dozen single page documents and receipts and it seemed to churn for a long time to get them done. With the old engine, I emptied a file cabinet … some documents being 40-50 pages and it seemed to go through it a lot quicker. Not sure if this will be an issue. Also, this is not a scientific analysis. Was there any benchmarking done for old vs. new?

Usable_Thought · February 13, 2009, 5:08pm

My first try with the new OCR engine was, well, uh … truly horrible. I took an extremely clean scan of a 10-page book chapter - single column, no unusual formatting of any kind - and imported it. Two problems:

Took at least 10 minutes to process the 10 pages. (I didn’t time it, but it was bizarrely long).
Many missing words. A little garble here & there I’m used to with OCR, and it’s usually relatively easy to correct. But simple one-syllable words dropping out?

Hopefully these issues will get fixed. I am a non-owner of DT at the moment, but had really liked what I’d seen with the previous betas of 2.0 and was looking forward to buying it once it got more stable. If the OCR stays at this level it would be a deal-killer - but I can’t imagine it will - there will be far too many complaints.

convergent · February 13, 2009, 7:29pm

That doesn’t sound good. I didn’t check the accuracy of the OCR… but it was very slow. The original OCR was very fast and very accurate.

… off to check what I just OCRd.

Bill_DeVille · February 13, 2009, 9:55pm

Note that ABBYY is configured in DTPO2 to verify word recognition against a dictionary, rather than to report garbage. ABBYY will drop out unrecognized words, leaving a double space marker.

This morning I edited the RTF conversion of a PDF OCRd using ABBYY. I used Find to locate all instances of double spaces and replaced all of them with a FLAG marker. That made it easy to edit, using the PDF as the master. Comparing the OCR accuracy of this RTF to the RTF produced from the same document OCRd by IRIS in DTPO 1.5.4, the ABBYY result contained fewer errors.

Usable_Thought · February 13, 2009, 10:24pm

As a writer I’d much rather have the garbage - it is more quickly recognizable and therefore fixable (by a human) than a double space which will force me back to the original source in every instance.

I also wonder about this “dictionary” that words are validated against: what does the software do when it comes up against a word that is not in its dictionary, e.g. a technical term, a neologism, an archaic term, an alternate spelling, etc.?

In short if there is a way to change this configuration and keep the garbage, and also throw away the dictionary approach, that would be my preference by light-years.

jwstex · February 14, 2009, 12:57am

I totally agree with Usable_Thought. Most of my scanned material in DTPO is technical articles which have many words not likely to be in the OCR dictionary. Those unknown words are also the words I’m probably going to be most likely to search for in the future. So it just doesn’t work to omit any words.

I also find that “garbage” words are still making it through, which seems to contradict Bill’s explanation. For example, the OCR output text is giving me: recenr (for recent), rcsolurion (for resolution) and imporrant (for important), none of which should be in a dictionary.