OCR examples, results of various methods

I have been clipping newspaper articles from Newspapers.com using several methods and I thought you may be interested in the differences in file sizes and accuracy.

This is a screen shot of a 1881 newspaper article I used for my test: (click to enlarge)

I used three different methods to create a PDF+Text file:

#1 - A Mac OSX .pdf screenshot imported to DTPO and processed to create a PDF+Text file.

#2 - A Mac OSX .png screenshot imported to DTPO and processed to create a PDF+Text file.

#3 - A Newspaper.com generated .pdf file (no OCR Text) imported to DTPO and processed to create a PDF+Text file.

#4 - the PDF+Text file generated using method #1 and then processed with PDFShrink (online version)

My DTPO OCR settings: Resolution same as scan, Quality at 85%, Recognition - accurate.

Here is a comparison of the plain text files generated using the methods described above: (right click and open image in a new tab to view the whole image)

I am still trying to decide which method produces the ideal combination of effort vs OCR quality. I would be interested in hearing your thoughts.

This is important information.

Would be great to have a chart comparing the sizes.

I did some tests using my ScanSnap and took the same original and used ScanSnap Manager/ABBY to desktop and compared using it right in to DevonThink with its built in ABBY Fine reader.

My files with the same settings tends to be two to four times larger in size when using DT built in ABBY Fine reader compared to the external one.

I now use ABBY Fine reader and import and do the OCR BEFORE importing them to DT. I figure I save some 50-75% of disk space that way.

I know I read something here in the forum about it when searching and think it was called OCR Engine. But gain, I am a real amateur and are just using trial and error.

Comment: I’ve got several OCR applications, including FineReader Pro 12.x, ReadIRIS, PDFpenPro, OCRKit and an older version of Acrobat Pro (which doesn’t run under Yosemite).

Most of the time, I stick with the OCR module built-in to DEVONthink Pro Office, because of the convenience of being able to OCR items already captured into a database – and its accuracy remains very good.

I don’t check the option to retain the resolution of the original scan in Preferences > OCR. If checked, that can result in large PDF file size. Instead, for most scans I’m satisfied with the view/print quality of images rasterized at 130 dpi and 50% image quality (especially for scans produced by a ScanSnap scanner). That doesn’t affect OCR accuracy, and the image size of the searchable PDF is often less than the size of the original scanner output file.

I resign from my previous statement!

Tried that setting and now I get 50-95% of the file size using the built in OCR-engine.

I had the settings set to 100% before but 50% works as fine for the OCR to work.

I have not tried to scan documents with partly pictures though but normally use it for text.

Thanks Bill and Basil, I’ll give that a try and will post the results when I get a chance.

I did ten random scans and with Bills settings it always is smaller files than when doing it separately.

Much more convenient not to have to go via the desktop as well.

I did get some files earlier that was over 3.5MB earlier each but I suppose this new setting is better to use. I looked in to the manual and help files a month ago but could not really find any comparison charts. The average now seems to be 50-250kB for one page pdf+text and that is perfect!

Since I have some 17.000 files and my database is growing everyday and it is important that I can bring them with me on my iPad/iPhone size is important.

Thanks for sharing the results of your tests. Here are the results of some tests I did with Asian languages, which DT’s OCR does not support (as far as I know).
christopher-mayo.com/?p=98