Some Feedback on OCR Accuracy

Some feedback on the OCR accuracy - for me a key concern to persuade me to make DT my ‘trusted’ source is the accuracy of being able to find what’s in the documents I hold in the database. So I tested the OCR accuracy. There’s a pretty wide divergence.

METHOD - I imported/ocrd the same pdf document in 3 programs (DTP 1.5, DTP2 Beta and Adobe Acrobat) at various different settings and compared the results. The document was in Spanish.

RESULTS - Each method produced a different word count. DTP2 produced less nonsense words but missed some significant blocks of text and numbers

Here are the word counts - Acrobat 714 Words (some nonsense), DTP1.5 727 Words (some nonsense, all numbers accurate), DTP2 626 Words (no nonsense but text, numbers and dates missing).

Examples -

Here’s a screenshot from the DTP1.5 import. Scroll all the way to the left of the image to see the Word list and Word counts. The number ‘160’ is in the document and correctly identified in the word list. There are also some obvious nonsense words in the word list -
DTPnonbeta.tiff (319 KB)

And here’s the result from the DTP2b3 import. Again scroll to the left to see the Word count info. The number ‘160’ is nowhere to be found in the word list. There are 100 fewer words, admittedly some are nonsense words but much text, dates and numbers is missing. There are no nonsense words probably because the word list is checked against a dictionary.
DTP2Beta.tiff (355 KB)

The bottom line is that Acrobat and DTP1.5 give more data, albeit requiring some further analysis, and DTP2 loses significant amounts of data.

Something needs to be done to improve the inclusiveness of DTP2 OCR or at the very least give an indication of what is missing.

I must be one of the few that haven’t had any difficulty with OCR’d images. Are the rest of you importing both the scan and having it OCR’s automatically or importing a pdf than converting to OCR’d version? Just so I know what not to do. :smiley:

Personally I normally import direct from a Fujitsu ScanSnap. I’m not saying that I’m unhappy with the OCR. The cross-referencing and AI features of DevonThink are sufficiently good that I can be pretty sure that I can recover most things even if the OCR is not 100% accurate.

I was just trying to empirically compare the current OCR capabilities with other versions and software. I don’t feel it’s as good as it could be but it may well be sufficient for many. When you say you haven’t had any problems have you actually checked the OCRd results versus the original?

Yes, I have not found any loss of data. Now I don’t do massive scanning and importing in huge batches so perhaps that’s a problem. I don’t know, but fingers crossed, for me it hasn’t yet been a problem. However, I know other users have had the same issue you’re experiencing.

What I like in the present OCR solution is that virtually no nonsense is produced (as far as I’m aware).

What I vehemently dislike is the fact that any word forms or numbers the OCR dictionary does not recognize are simply ignored. This includes many author names, many ordinary words, and virtually all specialized / archaic / idiomatic expressions. This poses a big problem, of course, because oftentimes the ignored, not-so-common words is what I would use in a search. As things stand, I cannot trust that a search by author name, or by a rare word, will work because I know that my scannings are omitting such words–and so I have to always stick to common words in searches, which makes them much less effective.

It seems to me that, if the present approach is to be maintained, then users should be given the possibility of adding words to a User Dictionary (much like the user dictionaries for spellchecking in word processors) that the OCR process would then use.

Bottom line: Users should be allowed to make this OCR solution less lexically ignorant if it is ever going to meet academic needs. Otherwise, I guess I will sorely miss the Iris OCR in the long run.

We’re still working with Abbyy to improve the situation for certain reported cases. That’s all I can say for now.

Fine Annard.

I’m sure you realise more than anyone how important accurate OCR is to the success of the application…

We made a change to the OCR engine settings in Pro Office 2 public beta 4. Now the results will include words that are not found in a relevant dictionary so hopefully that will improve your perception of the usability of the OCR engine.

Thanks Annard - I think this may be a useful adjustment. At least this way users have the option of using some of the data in a slightly misinterpeted word - e.g. “quick1y” would recovered in a partial search for “quick” whereas before it would be completely discarded using the dictionary based approach.

Perhaps there could be an option within DTP to toggle on/off a real word dictionary when one was looking at the word breakdown? I don’t mean in the OCR process itself but in DTPs own word analytics.

Okay - I just ran the same document through the DTPb4 release and things are, indeed, looking better. Not perfect but better.

DTPb4 now finds 689 words using the same document, up from 626 in the earlier Beta, there are still significantly less nonsense words than in the other methods. This time as you can see the software picks up the instance of the number “160” that it completely failed to recognise ealier.

DTPb4.tiff (585 KB)

Things are not completely perfect. Some significant data is still missing.
Here is a small graphic at the bottom of the document. DTPb4 crucially fails to pick up the date 20/3/2009 just below the graphic box. Acrobat picks this up in its OCR process.

DTPb4v2.tiff (45.4 KB)

Overall I’d say this is still not perfect but a great improvement. Hope this is helpful.

Have you fiddled with the OCR accuracy parameter in the OCR pref pane? There is not a lot more we can customise without building our own OCR application from the ground up. And that isn’t going to happen.

Hi Annard - I’ll run it through again later tonight with the higher accuracy setting and let you know…

Okay I ran the same PDF through the process again but this time using maximum Resolution, Quality and Recognition settings. This time the word count was raised by a further 5 words to 694 (up from 689 words on the default settings).File size increased from 353k to 2.3mb.
The date at the bottom was still not recognised…

Characters in close proximity to a graphic tend to confuse OCR applications.

Handwritten underlining, notes and signatures in proximity to text often cause recognition problems, as well.

File size increased from 353k to 2.3mb.

I think scanning and OCR is really the killer app for DT. But I’m having a hard time with v2 because of the increase in file size.

Currently I am not sure I understand why DT’s engine increases file size so much. I’m really curious about this. I’ve seen the comment before: DT’s engine has to “re-rasterize”. That may be, but as an explanation it is not helping me.

My experience with the betas so far: DT increases file size in the ballpark of an order of magnitude. And I think that causes DT to fail when performing OCR on large book-size scans (100+ pages, ~30 meg files).

What makes me so curious is that in my experience Acrobat reduces file size compared when I scan to a file (if I use ScanTango to scan to a file, Acrobat reduces the file size when it does OCR–not by much, but it reduces file size)

Maybe it’s not that important in the larger scheme of things (disk space is relatively inexpensive, a macbook pro has ample processor power, and DT really is convenient) but it would help me personally not resent the fact that after OCR in DT my files are huge

  • Is there a benefit to re-rasterizing a file? I plan on keeping my files for a long time, so if this is helpful in some way, I would find it helpful to understand that.

  • Should I be concerned about what seems to me to be large files after OCR? Will this impact DT’s performance? (It seems to impact my machines performance when I have to open a 200mg file)

Robert, I’ve got DTPO2 Preferences > OCR set for 150 dpi resolution and 50% image quality.

In almost every case, the file size of the searchable PDF stored in my database is less — usually significantly less — than size of the image-only PDF produced by my scanner.

Thanks for the suggestions. I would appreciate additional information about the image quality settings and the dpi settings. (I can’t find it in the help for the new version, but perhaps I’m overlooking it–although understandably it may not be written yet).

I experimented a bit:
starting file: 660k. reasonably good quality (image only)
DT ocr @ 300dpi/100% quality: 24 meg
DT ocr @ 150dpi/100% quality ~8 meg – but degradation in image quality

Acrobat: 6.7meg (set to not downsample below 600dpi); file is still as readable on screen as it was before

–of the 3 Acrobat is the best in terms of preserving the original image

What settings would you recommend I use if I want to:
a) preserve as much as possible the same image quality as the original scan and
b) produce this OCRd file in a size that is competitive with other products on the market. (e.g., I can live with file size the same as Acrobat)

I imagine the development and refinement of OCR in v2 continues apace, and hopefully I can learn how to get optimal results from the new OCR engine in v2

btw, you can get the file I’m using --it’s the testimony of Joanne Less from FDA before the House energy and commerce committee:
energycommerce.house.gov/index.p … ew&id=1552

Try 300 dpi and 80% quality. Lowering the quality in the 80 to 100 percent range usually results in smaller file size but not much visual degradation in image quality.

I would welcome an option to leave images in PDF files unchanged. I am happy with the file sizes that the scanner (ScanSnap) creates and just need the OCR to add the text, rather than changing the image (if that is technically possible).

I downloaded that file, which is 660 KB for 16 pages. I ran ABBYY OCR on it under DTPO2 pb4, with Preferences > OCR set for 150 dpi and 50% image quality. The resulting searchable PDF was 2.1 MB.

In a previous post, I was talking about scanning paper copy to result in searchable PDF. So, for grins, I printed the 16 page PDF and then scanned the printout with my ScanSnap. The resulting image-only PDF was 3.6 MB in file size. I then ran that PDF through ABBYY OCR with the same result, a searchable PDF with file size of 2.1 MB, so a significant reduction in storage space.

Using Data > Convert > to Rich Text, the RTF file of the text content of this PDF has a file size of 30.2 KB. The accuracy of the ABBYY OCR in DTPO2 pb4 was excellent, with no dropouts or misspelled words.

By contrast, the IRIS OCR engine used in DTPO 1.x usually resulted in a searchable PDF that was larger than the scanned, image-only PDF produced by my ScanSnap, using the same DTPO preferences of 150 dpi and 50% image quality for the stored PDF. Had I used the IRIS OCR engine, the searchable PDF resulting from my 16-page scan would have run more than 4 MB in file size.

Unfortunately, OCR software doesn’t “keep” the original PDF image layer, but rasterizes it (creates it) again, often with a significant increase in file size. This appears to be true of all OCR software. As DEVONtechnologies licenses the OCR engine, the internal rasterization process is a ‘black box’ over which we have little or no control.

I agree that there is some degradation of the image layer at 150 dpi and 50% image quality. That’s a compromise with file size. But for most documents I find it acceptable for viewing and printing the searchable PDF, certainly for mundane things such as bills , receipts, contracts, and letters. For some of my reference materials, especially those with images, I may up the resolution to 200 dpi and perhaps image quality at 65%. But I have no difficulty in reading a book-length PDF at 150 dpi (assuming the scanner had produced a clear image; not all scanners are equal).

The advantage of importing searchable PDFs into my databases is the ease of finding content and, of course, adding to the information content of my databases, which contain documents of various file types.

If it were sufficiently important, perhaps if I were going to use a PDF image in a printed publication, I would probably keep the original scan PDF for that purpose. Or, as I have Acrobat, I could tweak upwards the dpi and quality settings in DTPO2 and then use Acrobat to reduce the file size – a lengthy procedure which, if not saved using a PDF version of 1.5 or lower, could create problems for some viewers in OS X and other operating systems. (Most publishers of PDF files use PDF version 1.3 or 1.4 for that reason.)