OCR performance

Jones · December 30, 2008, 10:33pm

DTPO 2 OCR is not performing well for me. Perhaps someone here can help.

I’m starting with a simple editorial in Nature magazine, (nature vol. 456 no. 7221 p 421), straight text, clean font. Tried importing per the DTPO manual using Image Capture (p.290) with a CanoScan LiDE90; resulting image file was surprisingly large (>30 MB) with low contrast (background a bit dark). The OCR function took a while and produced about a 90 MB file, with no evident text.

Tried again, using Image Capture directly, outside DTPO. Adjusted image to drop out background. (Image Capture standalone includes controls not available within DTPO.) Captured clean image at 300 dpi to a 9.2 MB pdf file.

Brought that pdf into DTPO, applied OCR by using Convert to Searchable, which resulted in a 33 MB “PDF + Text” file. (Seems large for less than 1000 words.)

To view the hidden text layer, in the Viewer I dragged the cursor over the text to select the hidden text, copied, and pasted into an external text editor (using paste and match style to avoid hidden attributes).

Resulting text is incomprehensible bits of words from the document plus random characters which appear to be bad OCR artifacts.

So, I am getting very poor OCR of very high quality text. What am I doing wrong?

Bill_DeVille · December 31, 2008, 6:06am

What option settings did you use in Image Capture?

From DTPO 2, I just scanned a two-page document, which included several type fonts and sizes. I used a CanoScan LIDE 500F scanner, which is rather similar to yours.

File > Import > Document (via Image Capture) > Capture. I clicked on the options button and chose text (black & white) at 300 dpi. Scanned both pages with OCR into my database. The resulting 2-page PDF+Text (using the default preferences for resolution and image quality) was 796 KB in size. The OCR accuracy was very good.
File > Import > Document (from ExactScan). The ExactScan application opens, but without a window. Chose File > New Scan. I chose black & white at 300 dpi. The resulting 2-page PDF+Text was 622 KB in size. The OCR accuracy was very good.

In general, I prefer using ExactScan Capture, as the image quality seems a bit better and there are more options available.

Jones · December 31, 2008, 8:12am

I’m getting much improved results by following your detailed instructions, thank you. Two pages of Nature magazine are stored at 3.1 MB with very good OCR results. ExactScan seems to work a bit better than Image Capture for me also.

The stored image resolution is reduced in post-OCR processing a bit too aggresively for me (difficult to read). The Preferences OCR tab initial settings were Image Resolution: 150 and Image Quality: 50%. I changed these to 200 and 70%, guessing this would increase the post-OCR compressed image quality, at the expense of slightly larger file size. After re-scanning, final image quality did improve, but the resulting 2-page final file size shrank to 2.2 MB (from 3.1 MB). Weird, but I’m not complaining.

Will experiment more later. Thanks for pointing me in the right direction.