OCR issues

Ian_Hocking · August 24, 2008, 2:50pm

Hi there

Apologies ahead of time if I sound a bit grumpy, but I’ve spent a good few hours today trying to get the following to work. (It’s something I’ve been trying to accomplish, periodically, for the several months.)

[Devonthink Pro Office 1.5.3 on a MacBook Pro Intel Core Duo running Leopard 10.5.4]

I’ve also checked to the forums to see if this has been addressed but my searches only seem to throw up irrelevant articles on OCR in between errors on the forum page reading “Sorry but you cannot use search at this time. Please try again in a few minutes.”

OK.

Here’s what I’m trying to do: I’m writing a novel (actually, I’m just about to complete it) set in an historical period. There is an absolutely key text that I’ve scanned using my Epson V500 Perfection scanner. Basically, I want to have this book in my DevonThink database as a searchable PDF.

Try as I might, I just can’t do it.

It’s difficult to be precise in describing the usage scenario that leads to the problems. Basically, when trying to scan either (i) the multi-page PDF (which is very large at 500meg, but conforms to the OCR recommendation of 600dpi, colour, etc.) or (ii) individual pages scanned as JPEGs and TIFs.

The errors I get are many and various. When trying to OCR the multi-page PDF, I get a dialogue reading “Opening [file]” and nothing happens, even after several hours. It’s important to be able to have an OCR’d multi-page PDF because the alternative - individual pages - mean that I can’t move forward and backwards through the book from the point where my DT finds a match.

With regards the errors I get when scanning individual pages (my plan B), the OCR engine either stops, maxing the CPU, or gives me a error like “OCR couldn’t do the OCRing - do you want to skip or continue?”, or, rarely, it works fine. Occasionally, DT itself crashes.

It’s the non-reliable errors that are most frustrating. The OCR engine seems to be behaving almost randomly. I’ve varied everything I can think of: DPI, file format, the OCR settings in DT itself. Nothing seems to provide a stable scenario in which I can get this bloody book OCR’d.

Any ideas? I’m pretty sure it’s not my Mac, which is perfectly stable in all other aspects. The ABBYY FineReader OCR software I got with the scanner (which runs in Rosetta) works perfectly and about 4x as fast as the OCR engine in DT - the problem with FR is that it’s a lite version that requires several clicks per image, and won’t do multi-page files.

Any advice appreciated. This is driving me up the wall.

Best
Ian

Novelist in the UK

annard · August 25, 2008, 1:38pm

I don’t know why it doesn’t work because we’re only licensees of the IRIS OCR engine. You could send an email to support@devon-technologies.com with your crash report(s) and a sample document (but not if it is 500MB because that won’t work with Mail, with your permission we could ask IRIS to contact you directly).

If your sample document is 500MB then I would suggest to cut it up in pieces and later to glue them back together again. Doing a chapter at the time might be a good way to do this. You could group these chapters together in the database for instance, it may help with searching. But if you want to glue them together, Preview or Automator allows you to do this in Leopard. It may be that you cross the line with such a huge document with the internal limitations of the OCR engine (of which I’m not aware but it might be a technical “hunch”).

The OCR crash reports we will send to IRIS and they will analyse these. I’m interested in the crashes you reported for DTPO because of course our own code shouldn’t crash at all.

Ian_Hocking · August 25, 2008, 3:21pm

Thanks for the prompt reply, Annard.

When I have some time, I’ll try to do some more systematic testing of this and get some cleaner data.

In the mean time, I’ve looked at a trial version of Acrobat Pro for the Mac and this did a perfect job of the OCR for the PDF, which I’ve now imported into DT. If I can stump up the cash (or, more likely, get a licence via my uni) this would be a good workaround.

WritingStudio · August 28, 2008, 7:28pm

I am doing what you’re doing - writing a historical novel (though I’m not as close to finishing my current one as I’d like to be) - and have imported quite a few scanned books.

Normally, my process is what you later discovered: using Adobe Acrobat to compile and OCR the pdf. One big advantage of this process is that Acrobat will reduce the file size significantly. For instance, individually scanned files for an old book of 303 pages totaled nearly 120MB. Once imported into Acrobat, the single PDF of the book is just over 15MB.

Note that unless a specific page is in color that I scan either B&W (when just text) or Greyscale (printed images, usually), at 300 dpi, and scan to tiff. Unless the book is from the 18th century (and I have a few), the OCR works fine with my set up. (Naturally, OCR does not like the long ‘s’ at all.)

I believe that if one is scanning for archival purposes, then, yes, 600 dpi and color are the best choices. But I am scanning for access to the information, and the above settings work best for me with that objective.

(The above ref’d book was scanned with an Opticbook 3600 scanner to a Win Vista laptop (the software is limited to Windows), and I created the PDF on the Vista machine. The PDF was then transferred to my Mac and OCR’d with Acrobat, then indexed into DTPO as PDF + text with no problems.)

For articles and anything in single sheets, I use my Fujitsu ScanSnap S510M and either scan directly to DTPO or scan to tiff and use the built-in OCR (Abby FineReader). Here I use the auto-detect for what mode to use (color, etc.) so that varies depending on the source.

I don’t think you need to have the optimum settings to get a good scan + OCR for DTPO. Personally, I find very large files to be problematic so try to avoid them when I can.

Good luck.

bstadelman · August 28, 2008, 8:01pm

I’m not writing a novel like you guys - but I HAVE noticed some weird behavior.
Bear with me, I know it’s probably an Apple problem, but you guys know PDFs pretty well, from what I’ve seen here, and you actually answer questions!

If I scan (using my Fujitsu S300M, for example) something into a PDF, either into a folder, or to print and then save as PDF, or to print and then open in Acrobat, it works great.

If I then OCR it in Acrobat 8.1.2, it still looks ok, but the file size is HUGE.

If I then do “Reduce In Size”, that is when the weirdness happens.

If I open the resulting smaller file in Acrobat, all is well. If I open it in Preview, the print looks terrible - blurred and gray. Same if I open it in DTPO - but since both of those use the same PDFkit, that’s why I think it’s an Apple problem. I’m using 10.5.4 on a Mac Pro.

Any ideas?

Thanks,

Bill.

Bill_DeVille · August 29, 2008, 1:24am

As you saw, there are some incompatibilities with how Apple renders PDFs and the default PDF version you saved from Acrobat 8. OS X 10.5.4 is probably still “stuck” at compatibility with Acrobat 7.x – which by default saves files as PDF 1.6, as I recall.

When you save a file that from Acrobat 8,x, try saving it as version 1.6, 1.5 or 1.4. Note that in OS X, when you “print” a file as PDF, it is saved as PDF 1.4.

I’ve got Acrobat 7 and haven’t upgraded yet.

bstadelman · August 29, 2008, 1:59am

Bill:

Thanks ever so much for the reply!

I ran some tests and you have to go all the way back to Version 5.0 and higher compatibility to have it work correctly. You can also go to 4.0 and higher and have it work right.

When I chose 6.0 and higher, or 7.0 and higher, the same problem was still there.

Again, thank you VERY much! See? This is why I asked you guys!

Bill.

brisance · September 1, 2008, 6:54pm

Hi folks I just downloaded a trial version of DT Pro Office and am suitably impressed, however I’ve experienced the same ballooning PDF file size… is something going to be done about this (perhaps a future patch) or would we have to live with it and start buying shares in hard disk manufacturers?

Bill_DeVille · September 2, 2008, 12:46am

DEVONtechnologies doesn’t control the re-rasterization of the PDF image layer during OCR – that’s done by the IRIS OCR engine. All DEVONtechnologies can do about this is provide user preferences to reduce the dpi resolution and the image quality of the resulting PDFs.

I have two long-standing wishes for PDFs and OCR: 1) that the original PDF image would be retained (without rasterizing the image again) during OCR and 2) that OCR errors in the text layer of a searchable PDF could be edited without changing the image layer. I’m still waiting.