OCR problems - not recognizing spaces & repeated use of

milhouse · February 28, 2007, 4:06am

Hi,

In performing a concordance on the database I noticed more than a few rather strange words.

For example, the word, “Abiosocialview” appears near the top of my list. When searching the actual document I find that the phrase, “a biosocial view” is present. The spaces seem rather clear to me but the OCR’d document contains numerous similar errors. Most common is the omission of spaces.

A secondary issue is the repeated use of a single word in multiple modes. For example, the single word, “altruism” appears as “altru” and altruism" in the words list for one particular document.

This certainly impacts the effectiveness of the AI as well as other more common searches that involve OCR’d documents.

Is there a way to “fix” or eliminate this issue? Is it some sort of bug?

thanks

Bill_DeVille · February 28, 2007, 5:38am

It’s not a bug in the Concordance. Those “accreted” words are in the text layer underlying the image. You can verify that by searching for a document in which the run-together word is found, then using Data > Convert to make a plain or rich text version of the PDF.

I tease some of my friends by commenting on the frequency of run-together words in German.

Run-together words can occur in the OCR process, using any of the OCR applications I’ve ever used, on the Mac or in Window.

But I’ve seen run-together words in PDFs that were produced by a word processor. This example, “widthdemandthanthestrictlyunicast” occurs in such a PDF prepared for publication in a computer science journal. That particular text string occurs in only one document in my database. I did a search on the string and found that document. Yes, there are obvious spaces between the words in the image layer. But in the underlying text layer, it is a run-together string. Why? I don’t know. The original document was done in MS Word for Windows and converted to PDF. I assume that the error didn’t exist in the Word document but was an artifact of conversion to PDF.

That sentence should read “One possible cause is that ODMRP is a
multicast algorithm and has a more stringent band-
width demand than the strictly unicast protocols.”

Split/truncated words can result from hyphenation of a word at the end of a line. They can also result from OCR errors.

Yes, such glitches can cause problems for AI routines. Fortunately, they are relatively rare in my database, and there’s usually enough redundancy in documents that searches usually work well enough.

I’m still waiting for someone to release an application that would let me clean up such glitches in the text layer without affecting the image layer.

parlar · March 1, 2007, 3:45pm

One interesting thing I’ve run into is that I’m starting to notice a lot of my PDFs that were already searchable when I downloaded them, have some bad OCR mistakes. So I simply have DT run its own OCR on the PDF, and I get a much better result than what I had before!

annard · March 1, 2007, 4:02pm

That is another good use of the Convert feature!