New OCR engine in DTPO 2

inkwell3 · February 14, 2009, 7:17am

I also agree with Usable_Thought. I would especially like to be able to know what is in this dictionary and have the ability to switch it off. From my point of view, garbage is far more preferable to the double-space solution.

More generally, my observations on the new OCR capability thus far are:

It is glacial: to convert a PDF of three pages (1327 words) took 15 minutes.
It hogs CPU: in the above example a process called DEVONscribbler took up between 62% and 91% of the CPU (most of the time at the higher end).
It is not as flexible as ReadIris: it refused to convert five documents scanned at 225dpi. I then opened DTPO1.5.4 which quickly converted the files to PDF+Text with no problem (coincidentally producing files about one quarter in size of the original scans by DTPO2).

lemuba · February 14, 2009, 1:31pm

Yes,

please give us back the old ReadIris OCR routine!
The amount of missing words is horrible - sometimes even easy words are missing - also I prefer “POSSIBLE” garbage instedt of missing words!!!
The Abbyy routine is exremly slow in opposite to before.

The only benifit I see currently is the smaller PDF file size - but this was also finally for me no problem as I could reduce this easiely out of Devon, using PDF Shrink - open PDF with… and then shrink the PDF by “my default” PDF Shrink setup. I especially purchased PDF Shrink for this use.

THE STEP FROM ReadIris TO ABBYY OCR ARE TWO STEPS BACK (today)…

lemuba · February 14, 2009, 2:06pm

after some investigations on the same sample newspaper article (3 columns, one header, one image) and initially a big amount of missing words and long time conversion (gray level scan 300dpi):

I set up now in Options/OCR the quality to 600 ppi and very large file size (best quality). After scanning, the OCR recognation was resonable faster - the amount of missing words went to zero! But the file size of this approximate A4-size PDF was 3,3 MByte! Running PDF Shrink over it reduced the size finally to 280 kbyte at a resonable good visible quality - I think I found my workflow… if Devon keeps Abbyy.

ionos · February 14, 2009, 2:53pm

@lemuba: Would you mind telling me which PDF Shrink settings you are using?

Thanks, Clemens

Usable_Thought · February 14, 2009, 5:37pm

I tried this with my book scan and got the exact same result - the exact same missing words, etc. And it took 25+ minutes for the 10 pages.

FYI, my comparison is tesseract, the Google OCR engine - I have an AppleScript that makes a shell call to it. It can’t handle more than one column of text and can’t handle san-serif at all, but for one-column serif it’s great - very fast, very accurate.

I produce my PDFs via a similar AppleScript that makes calls to unpaper (a shell utility for cleaning up images prior to OCR) and shrinks PDF size with JPG compression. The only drawback is this approach doesn’t allow embedding the OCR in the PDF page itself - I can make a separate text file and/or attach notes to the PDF pages, but that’s it.

lemuba · February 14, 2009, 6:27pm

@ionos/Clemens,

Hallo Clemens, also…

Grayscale/Colour Images: 150dpi, Jepeg 2000, compression - medium

Monochrome: 96dpi, ccitty4

Fonts: everything on, remove metadata: off

File Output: Overide Original

Mark this setup in PDF Shrink as your default setup.

From Devon Think - “open file with” “PDF Shrink” will shring immediately your PDF and overwrite it with the shrinked new PDF/same filename - so take care before that you are happy with the quality, or try first on a copy of your file…

Mit besten Grüßen,

Matthias

rolfschmolling · February 15, 2009, 6:32pm

Hi,

I’d like to chime in: this is the first time I tried DTPO. So I cannot compare speed and such. My gist is on another aspect: I don’t own a speed-scanner with automatic paper-intake but a fine Canon 8600F. When trying today to scan (and OCR) a book – two pages, I could not find an option to make two frames (first, then second page) to scan in the two pages. The OCR engine would not accept this. So in the end I get an OCRed pdf-file where the sentences run over two pages: starting a line and continuing in a totally different sentence.

When I compare this with AcrobatPro CS, at least there I could manually select first one page/frame then the other. Now AcrobatPro CS4 has learned and scans automatically in one turn both frames one after another. I wonder how this would work with a speedscanner anyway, but this clearly influences any ability of DTPO to properly analyze the document or the user to use it. I have not yet tried if I’d get better results when I call the scanner-software itself.

As you probably notices I own AcrobatPro CS4 so DTPO is properly of no necessity to me, still I wanted to share these observations.

regards, Rolf

JRPars · February 15, 2009, 7:28pm

I, too, have found the new OCR to be slow, but I have not had issues with it dropping words. In conjunction with OCRing some documents, I have noticed that the search function doesn’t seem to pick up words that may not be in a standard dictionary. In my case, I have several old newspaper archives with words like toots, blowers, steerers, etc. Search does not find those in the database. It will find them if I enter those terms into a new text document.

Usable_Thought · February 15, 2009, 8:46pm

More from my experience:

The OCR seems quite a bit faster working with JPGs at 300 DPI than existing (non-OCR’d)) PDFs at the same resolution.
It is possible the OCR is also dropping fewer words when working with the JPGs than the PDFs, but I have to do more cross-comparisons to see if this is really true.
One unpleasant aspect of importing JPGs for OCR: even when the initial JPG is a very crisp, contrasty image, the OCR process seems to result in a very dingy, gray PDF, even with maximum quality (big file size) in Preferences.

jean_alexis · February 16, 2009, 11:05am

I can confirm the OCR problems, It tends to ignore the words or even whole lines at the beginning, or the end of the block. It usually refuse to read the URLs or reference numbers.
It also has difficulties in reading small text.

I would myself prefer garbage than so many missing words (about 10% on small blocks). On about 50% of the blocks the last word is missing.

As for the speed, it’s slower but it’s not an issue for me and memory usage seems much better.

Otherwise for what’s read, it seems more reliable than ReadIRIS, it can read clear text on a dark background.

Usable_Thought · February 16, 2009, 12:22pm

Alas, word-dropping is just as common with JPG import as with PDF. Speed is faster, but that’s the only improvement.

And the words being dropped are neither obscure nor typographically complex: examples are “tell” (several times on one page) and “night” and “1945”.

convergent · February 18, 2009, 3:51pm

Any word from the DevonTech team on this? Is this what we are going to have going forward or is this beta going to be tweaked. I’m really conflicted on this because I was so happy with PB1 and haven’t been happy since. I would pay (more) money to get that version back as it worked so well. I’m not back to a stack of document that I’m not scanning in until I hear more about where this is going.

annard · February 18, 2009, 8:55pm

Of course we are looking at ways to improve this, we are asking Abbyy to assist us with these issues. The next release should already work better. The good thing compared to version 1.5 is that we have more possibilities, the bad thing is that we have more possibilities.

convergent · February 19, 2009, 2:44am

I wasn’t referring to version 1.5, when I said I’d love to have back the old version… I was referring to public beta 1, which is what I originally received when I bought the product and it worked wonderfully. What we have now is a bit of a mess it seems. Some guidance on a suitable workflow to attain similar performance as public beta 1 would be very welcome from DevonTech. I am sitting here with documents piling up on my desk that I don’t want to scan, because I am highly depending on the OCR/search for my ability to locate the documents in the future. The version in public beta 1 indexed very well, and was very fast. The current version does not index well, and is slow. Please provide some guidance on the best settings to achieve similar speed, file size, and percentage of properly indexed words. I have to believe that before the engine was switched, that some testing was done on this stuff, right?

jayco437 · February 20, 2009, 4:47am

Ouch. Scanned a 4 page document using Vuescan (300DPI, black and white text). Weighed in at about 300kb. It was a 2 page front and back credit card bill. OCR’d it with Acrobat Pro 8 using default settings at 600DPI. Document was about 160kb after OCR.

Took that same document (before Adobe OCR) and imported it into DTPro. OCR’d using default (600DPI/100%). Resultant file was 6.6MB. I went back in and changed the settings to 300DPI and 75% quality. The file size shrunk to 2.2MB. There was a noticeable difference in the visual quality of the 600DPI version and the 300DPI version (as could be expected) and the file size difference was pretty big. The actual OCR that DTPro ended up with did seem a little more accurate in picking out words that Adobe.

300DPI, 100% quality yielded a 6.3MB file. I could shrink this to a slightly more manageable 1.5MB using a custom filter in Preview (Reduce File Size).

I’m really hoping the automator actions and AppleScript comes through so I can customize my workflow to lighten up the size of the files just a bit. I like the quality of the OCR and the actual image quality at 600DPI/100%, just wish the file size could get down there a bit.

I’m not sure what exactly is causing the bloat, but I’ve OCR’d about 10 files all ranging in size from about 8kb up to about 500kb. After OCR in DTPro I get sizes from 700kb up to 10.8MB. It seems like the bloated files all have tables in them (these are billing statements, receipts, tax forms, etc.)

Also, can we get a status bar at the bottom of the app, similar to the kind you find on many other applications? Maybe I’m overlooking it but it’s not an option noted in the Help file or the View or Window menus. I’m on a laptop much of the time so I don’t really prefer a bunch of separate floating windows that lose focus behind the app.

Also, I read the recent post on the blog about customizing toolbars, but the one thing I use most often is the “Convert to Searchable PDF”. It doesn’t appear to me that we can put that in the toolbar, or in a shortcut key. Having both as an option would be nice, since DTPro doesn’t appear to be automatically converting incoming scans at the moment (though that would solve the issue too)

Overall I’m pleased with the program and the beta.

mr_wahlee · February 23, 2009, 6:31pm

I too am seeing problems with the new OCR engine (beta 3). Misses words or phrases that have a slash in there (e.g., …input/output…) and no hint that a word was missed. Perhaps a way to have the engine report when its confidence level is less than 100%? Like others, I can report that it is slow (and I’m on a 8 core MacPro) and creates files that are enormous (3 page fixed font typed document weighs in at 6MB. If i just grab the text off the page it’s about 160K). Guess you can distill this message down to: please work on the OCR engine or consider putting IRIS back in there…

CharlesWJ · February 24, 2009, 11:34am

I have a networked Canon multi-function machine. Readiris Pro has no difficulty in finding it and doing its stuff. The new ABBY engine does not however find the scanner (though Image Capture does). I note that ABBY offer several versions of which only one supports networked scanners. Is DTPO 2 unlikely therefore to support such scanners?

Charles

annard · February 24, 2009, 12:19pm

Once Apple explains to their third-party developers in clear to read API documentation how to use networked scanners, we will support it in our Image Capture scanning plugin. So far I have not heard anything useful on the scanning front from Cupertino, despite having paid for technical support.

Usable_Thought · February 24, 2009, 2:33pm

What Canon is it? Based on my experience, there might (maybe, possibly) be a possible fix to get ABBY to see it.

I have a Canon 4400f which I bought because it was (a) dirt-cheap, and (b) supposedly Mac friendly. However the OCR software that came with it was OmniPage and was virtually unusable, so I looked for alternatives.

One that I tried was VueScan - but VueScan could not see the Canon’s TWAIN driver. I fixed this by using Image Capture > Bowse Devices to set the Canon driver to “Shared.” You might try that and see if it helps ABBY as well.

I ended up not going with VueScan, but cobbling some scripts together to use Google’s Tesseract and Ocropus command line utilities along with ImageCapture. Not perfect but free & better than OmniPage. And at the moment, also better than the version of ABBY that comes with the DT 2.0 beta!

CharlesWJ · February 24, 2009, 2:39pm

I understand that it’s not easy. I’m a VueScan user and have been testing for its author, Ed Hamrick, for several months now. He has finally cracked the problem, the latest version (8.5.0.4) sees my scanner but there’s been a a lot of frustration on the way.

Charles