DT Office's OCR: solid, reliable and intelligent, or … … ?

Timotheus · January 23, 2007, 5:36pm

Today I have scanned a whole book with the latest version of Readiris (11.5.6), which I bought a very short time before DT Office was launched (if I only had known …), and which should be more or less identical to the version contained in DT. It was not a pleasure: on the contrary, sometimes it was a true calvary. The main problem was (and is) that during the process of recognition of scanned pages (“interactive learning”), Readiris frequently freezes, or even crashes completely. Throwing away Readiris’ preferences brings some relieve, but only for a short period of time.
Moreover, I’m not impressed by Readiris’ learning capabilities, which are praised to the skies in the user manual. The book, which was scanned with an Epson Perfection 3200, is very neatly printed; so that can’t be the problem. But even after hours of correcting misreadings and saving the corrections in the correction dictionary, Readiris continues to make the same elementary mistakes: it continues to read “96” as “%”, “che” as “ohe”, “guerra” as “guena”, “alle” as “aUe”, etc. etc., no matter how many times these mistakes are corrected and saved in the correction dictionary.
Judging from what I’ve seen today, I must conclude that Readiris is stable nor precise nor intelligent. Does this sound familiar to those who have worked with DT Office, or do they have a completely different experience?

annard · January 23, 2007, 6:53pm

Like any software that tries to do things “intelligently” your mileage may vary. I think that the next time you scan a whole book with the intention of running OCR on it, I would play with the brightness and contrast settings to help optimise the results. According to IRIS, 300 dpi and colour are optimal. Then after a few pages you can see what settings work best and then do the whole thing.

As to the results of using IRIS in DTPO: at least you don’t have to agonise about the learning not working, because we don’t support it. It is a fire-and-forget experience. We also reset the engine between each job. Bill has written numerous posts about his experience with “typos” and according to him it is not something that bothers him in actual use of the resulting materials. Our AI is an advantage here since it doesn’t just index on words like Spotlight does.

Also, if IRIS crashes, send them the crash reports with the document that caused the problem. They are very responsive when we contact them, so I believe they do want to improve their engine over time.

So, after all this I recommend our solution for people with straightforward scanning and OCR needs. Whenever you need to use customisation to optimise your OCR results, you’re probably better off buying Pro and ReadIRIS separately.

Maria · January 24, 2007, 12:32am

My 2 cents about ReadIRIS: The company is horrible, the web site is horrible, the software is horrible. Keep your hands away, try to get your money back if you had already bought.

I bought the Western and the Asian package more than 2 years ago. The results are always of a quality for the waste paper basket. I wanted to try the 11 version because Bill said it was much better. But the download turned out to be impossible. Though I get mail like spam from them on this and that occasion, there is no reply on my question by email. Of course, I never got replies to some questions about other problems. – Well, I got a reply once on a certain question, when they admitted that there was indeed a problem, and that they would come out with a solution. Of course, there was no solution and no answer on my subsequent mails.

I decided not to upgrade, under no circumstance. Particularly since I realised with DTPO that their OCR cannot be trusted anyway (my results in DTPO are of a quality that I prefer not to rely on it and just ignore it). Now that I hear from Timotheus who has the same experience (no learning effect etc.), I feel relieved. Thanks for your report!

Maria, still upset

Timotheus · January 24, 2007, 6:39am

Thanks Annard, thanks Maria. I must confess that my experiences with version 9 of Readiris were negative too, yet I upgraded hoping the shortcomings of version 9 would have been eliminated in version 11, because a really need a reliable OCR program.

But I’ll present my problems to the company; we’ll see what their reaction is.

Maria · January 24, 2007, 7:05am

Great,
if you get into a real contact, please tell them other users are still waiting for answers from 2004, 2005 and 2006. They may check their emails.

Maria

Bill_DeVille · January 24, 2007, 7:55am

Hi, Maria and Timotheus.

As I’ve used every OCR application for Mac since the first – Xerox, I think – I probably have more patience with the current ones than others may. Those early OCR apps were really horrible!

I would rate the OCR engine in IRIS 11 better than anything else currently available for Mac OS X. It’s faster and more accurate than OmniPage or Acrobat 7, although I haven’t tested Acrobat 8.

I don’t like the interface of the ReadIRIS 11 application, and was frustrated the only time I tried to do editing or training after OCR. But I found the OCR was quite accurate, and faster than the older ReadIRIS 9 application.

I don’t expect completely error-free OCR. My expectations are that the OCR accuracy should be accurate enough that the PDFs I store in my database can be searched and found, and that criterion is met by the engine in DTPO. I really don’t expect complete accuracy for fine print, often including footnotes. I usually retrieve PDFs and read or print them, which of course does present error-free text in the image layer. Even in cases where OCR errors are frequent, such as blemished paper originals, I’m happy that I can throw away the original paper, and if necessary I type something in the Comment field so that I can find even scanned handwritten notes.

Only occasionally do I need to copy text from a PDF, e.g. to quote text in another document. If the original is clean and doesn’t contain really small or unusual fonts, I usually don’t need to clean up OCR errors. But sometimes I do. That’s most likely to happen if the original paper had blemishes, or handwritten underlining, highlighting or notes. Having the PDF file available for comparison when I do have to edit captured text is important. I wouldn’t recommend trying to scan/OCR directly to text for that reason.

Overall, I get very few OCR errors from my Fujitsu ScanSnap, but perhaps a few more from my CanoScan LIDE 500F. So the scanner being used may be a variable in OCR accuracy.

Maria, I do almost entirely scanning of English language documents, most of which use standard fonts. I’m wondering if one or both of those variables (language/font) might be important as reasons for your lack of satisfaction.

I’m delighted by reduction of paper. I’m a happy camper even with occasional OCR errors. I’m slowly nibbling away at boxes of papers in my study, and the ability to actually find stuff in my database makes those documents much more useful than they had been while in boxes.

My biggest gripe is probably the 50-pages per document scanning limit imposed by IRIS’s license restrictions. Once in a while I have to scan/OCR lengthy documents. That requires me to separate the document into stacks containing less than 50 pages, scan each stack, then ‘glue’ them back into a single document. (I’m using Acrobat for that.)

Maria · January 24, 2007, 8:47am

Bill,

thanks for this elaborate answer.

I would be delighted by reduction of paper as well, and I could live with occasional errors, would they occur in words like “this”, “And” etc. But when errors frequently occur like “haml are” and “hamt axc” instead of “hand axe” and – like Timotheus experienced as well – the application does not learn, it is annoying. This is exactly what happens with ReadIRIS.

Of course, I did not exclusively use English documents, but also Japanese and German, but I have paid for these language packages, quite an additional fee by the way for Japanese, and it should work.

More than that, the policy of the company – ignoring customers questions and problems – is unacceptable.

So I was interested in what DT would make out of it and was willing to pay again and get the OCR engine with your companies excellent service. But the way the engine is implemented (no control over what is OCR-d, no control over language, no Asian languages) is unfortunately useless as far as my special case is concerned.

Best,
Maria

Bill_DeVille · January 25, 2007, 12:38am

Hi, Maria. That’s a bummer about Asian languages, but I suspect IRIS would charge DEVONtechnologies an additional hefty license fee for including more languages.

Just for grins, I copied your post into a new rich text document in my database, deleted “hand axe” and then did a search for “hand axe” using fuzzy search. DT found your post, based on either (or both) “haml are” or “hamt axc”.

Score one for DTPO!

The government agency I retired from made a real document management mistake. They scanned millions of pages of the agency’s existing files, then threw away the original files.

Just two problems. They didn’t do OCR, but relied on minimum wage clericals hired by a contractor to enter attributes for each PDF: Title, Subject and Keywords. But they didn’t know anything about the terminology and seem to have entered Subject and Keywords almost randomly. And they scanned at such low resolution that many of the images are impossible to read.

Result: many of the agency’s historical documents cannot be found by searching, and those that are found are often illegible.

The IRIS engine would have been fantastically better. Even in the worst case, the PDFs are easy to read.

When you feel glum, remember what Voltaire’s Pangloss said: “This is the best of all possible worlds.”

Maria · January 25, 2007, 12:58am

Bill,

I like the way you get along with the shortcomings of this world.

But I am a scientist (if you may call an archaeologist a scientist…), and as long as I cannot be 100% sure about my tools, I won’t use them.

I did the same like you with my example, deleting two of the three spellings of “hand axe” from the duplicated file and did a fuzzy search. As a result I get too many results (including “and”, quite a frequent word), but the related files with the other writings were NOT FOUND. This search does not work in my case.

This is not against DTPO, it just happens that it is not useful for me personally, but may be others are happy with this best among lesser solutions.

My rant was against the other company, and I think it is most justified.

Best,
Maria

danzac · January 25, 2007, 1:51am

Bill, you know you can append PDF’s to each other via Automator, right? Not that it matters much, you’ve shelled out the big buck for Acrobat already, but the rest of us poor serfs can use automator.

Bill_DeVille · January 25, 2007, 5:36am

Hi, Maria: I was trained as a scientist, too and have published research in biochemistry, physiology and molecular biology, many years ago.

One of the first things I learned was that all measurements (and the tools that we use either to observe or conduct measurements) have a degree of uncertainty. So we often have to use statistical tools to evaluate the likelihood that reported measurements fall within an “acceptable” range of error.

And one learns that in the very small world of electrons, for example, measurements that attempt to determine the location of an electron are inherently probabilistic. The location of an electron cannot be described as a specific location, but rather as a ‘cloud’ of probabilities that it may be at various locations, some more probable than others.

The error rate in my OCR’d PDFs is often considerably lower than the error rates of methodologies and instruments used in the physical and biological sciences. Perhaps that’s why I tend to seem forgiving.

The question of whether that error rate is “acceptable” depends on what one does with the PDFs. I don’t worry about a few errors if I can search a PDF with a high likelihood that it will be found by a search query, and that’s generally true, especially if the document contains a reasonable number of words, which usually increases the likelihood that my query term will be rendered correctly at least once. Once I’ve found what I searched for, I will read or print the image layer, so OCR errors will be invisible.

But if I were capturing text to be quoted in an article, I would proofread that text very carefully. Note that I’ve carefully proofread the text of a couple of OCR’d PDFs that contained more typos in the original than OCR errors. And sometimes, depending on how ‘clean’ the original paper copy is, or the quality of the PDF image (contrast and resolution), or the font used, or the font size, there may be many OCR errors.

I don’t know of any application that would let me correct OCR errors in the text layer of a PDF, without affecting the image layer. I’d love to have something like that.

Dan: Acrobat 7 came free with my ScanSnap (and I’ve had Acrobat Pro versions for years, anyway, because I needed the app). But I’m glad Annard has included the Automator workflow.

Timotheus · January 25, 2007, 6:27am

Thanks, Bill, Maria, and others. I presented my stability problem to the Readiris company. They answered within a couple of hours, recommending to deinstall completely, and then to reinstall, and to send them a crash log if this wouldn’t solve the problem. Not a surprising answer - anyone could have given a similar advise - but at least it’s an answer. We’ll see if it works.