If you take a look at the workflow presented by current OCR applications to check for conversion accuracy before PDF conversion, I think you would agree that it’s not practical.
But if you’ve captured an important document and have reason to believe that there are OCR errors that affect searching and analysis, the procedure below is much, much faster.
Select that PDF, then Data > Convert > to plain text. This will create a new document in the database with the same name as the PDF, except that it has a .txt extension.
Spell-check the document. Compare names, dates, or other important elements to the original PDF and change them if necessary. The result is correction of OCR errors in the text version. Leave that text file in your database along with the PDF version.
Now, when you search for a person’s name, for example, it will be found in that text version even if it was ‘hidden’ by an OCR error in the PDF+Text file.
Does it sound like I’m creating lots of duplicated documents in my database? No, that’s not the case. In reality, I’ve got less than ten ‘twinned’ files to handle OCR errors resulting from poor quality of the original paper copy.
There are two reasons for that:
[1] OCR errors have no impact on the text you read on-screen or from a printout of your PDF documents. What you read is an image of the original paper copy. In fact, you can read PDFs containing my handwritten notes (that is, if you can read my handwriting; sometimes, I can’t, either).
[2] OCR’d documents from reasonably good copy have few errors, and most documents have sufficient redundancy that a query will find the document anyway. In cases where that isn’t true, such as an important name to be searched for, one can enter the correct spelling of the name in the document’s Comment field, or create a plain text twin with OCR errors corrected.
There’s no need to be obsessive about correcting each and every OCR error, as a practical matter. But if you are working from an old, yellowed and smudged piece of paper you may want to make a text file ‘twin’ to check for critical conversion errors.
Example: I scanned and OCR’d a magazine article that contained references to Superfund hazardous waste sites, some of which I had investigated. Later, I did a database search and that document didn’t come up in the results, although I remembered it.
On checking, I found that the magazine article had misspelled a Superfund site’s name. Duh! OCR is unlikely to correct typos in the original. So I just inserted a note in the Comment field for that document, including correct spelling of a site name.
If you are scanning a historical document, such as the death certificate of Elijah Somerwell (whoever that may be), and the paper is old and yellowed, it’s prudent to do a Find in the PDF just to make sure that a search for Elijah Somerwell finds those terms.
But perfection is the enemy of the good. Don’t waste time.
On the other hand, I’ll never forget an amusing incident. I was packaging up print-ready copy of a book to send to the printer for publication. We had used a very experienced editor to help prepare the material, which was one in a series of bibliographies on science and technology policy. As I was closing the box I took one final look at the cover page copy. The editor’s name was misspelled! She had done a great job on the project. But the cover page was done late in the project, she glanced at it and missed the error in her own name. So did the rest of us, who ignored that obvious error, as well.