two questions about ocr and startup database

fbrzvnrnd · December 13, 2006, 11:57am

Two question I’m not able to find in manual:

a- how can I say to devonthink pro office to start a database when I launch the application?
b- how can I correct the mistake OCR does after a scan? I’m able to copy the text from the pdf to a new text document, but I can not correct the OCR’s mistakes in the text hidden in pdf!

thanks for the answer.

f.

Bill_DeVille · December 13, 2006, 5:56pm

a - In your most frequently used database, select File > Database Properties. Near the top of the Database Properties panel check the option to make this the default database. Now, whenever DT Pro is launched it will open this database.

b - Sorry, this can’t be done inside DTPO, nor is there a totally satisfactory way to do it in other applications, including Adobe Acrobat. Even in Acrobat correction of OCR errors is somewhat limited and clumsy.

Of course, when one views or prints the PDF file it appears error-free, as one is viewing an image of the original paper copy. But sometimes the searchable text ‘underlying’ the image has errors, such as when the OCR engine couldn’t ‘read’ small print or there was a blemish on the original page that cause an error.

If scanning has been done with a high enough resolution there should be very few if any OCR errors for scans of ‘clean’ paper documents. I normally use an effective resolution of 300 dpi when scanning. For more critical (or more difficult) document scans I’ll use a setting of 600 dpi.

Note that resolution alone doesn’t tell the story. If one has ‘fiddled’ with brightness or contrast settings it’s possible to degrade OCR performance. I’ve been very pleased with the quality of images produced by the ScanSnap, as they produce very few OCR errors.

For most documents I don’t worry about a few OCR errors, as I can search for terms in the document anyway.

fbrzvnrnd · December 13, 2006, 6:18pm

thanks you for the answers.
about the ocr: yes, I see converting an pdf to text some errors. This is normal in a ocr scanning but the errors inside the pdf cause problems in search, oblivious.
Why the correction is not possibile in DTPO? Is a technical limit or a devon choice?

Thank again.

f.

Bill_DeVille · December 13, 2006, 8:24pm

It’s a technical limitation of Apple’s PDFKit. Editing of the text “behind” the PDF image isn’t available.

Scans of ‘clean’ paper documents that use standards fonts produce very good OCR accuracy.

Example: I scanned and OCR’d a 154-page record of a court hearing. There were two OCR errors, both trivial. One was an error of interpretation of the court reporter’s stamp on the cover page, and the other was a glitch caused by the image of a signature on the last page. There were no errors at all in the body of the text. In this particular document 11 point Courier was used throughout.

But some documents may have errors. Causes include unusual fonts, very small print, blemishes that make text recognition difficult, e.g. dark highlighting, handwritten annotations, and so on. Errors are likely in scans of paper copies of low-resolution faxes.

I haven’t been satisfied with any software I’ve tried to correct the text layer ‘behind’ the PDF image. If anyone has found good software/technique let me know.

fbrzvnrnd · December 14, 2006, 8:06am

Mh… I understand. But: when DTPO scan the document, it get the image, after DTPO starts the OCR process and get the text, and after put the image and the text in a single PDF. Why I can not edit the text BEFORE DTPO puts it in a PDF? I know this is not possible now, but a additional step to correct text after the OCR and before the PDF could be a simple solution to have clean searchable PDF… or I’m missing something?

f.

ndouglas · December 14, 2006, 2:10pm

You might consider converting the PDF to plain text, which would allow you to edit it, and then making a WikiLink to the original PDF document. It’s not a perfect solution, obviously, since the two are still in separate files, but at least if the plain text version shows up in a search, you can get to the original with just a click.

Or copy that plain text into the comment of the PDF. I’m not sure how this would affect the performance of DT, especially if you did this for a few hundred documents (which would be lame anyway).

fbrzvnrnd · December 14, 2006, 3:02pm

Yes, but it seems to me a quite complet workaround. If I wanna have a clean PDF I have to scan the paper with ocr, convert to text, correct text, delete the pdf, scan again without the ocr, and link.
The choice to correct the text before put it in the pdf could be a better solution, I think.

But I do not know if it is possibile to do.

f.

Bill_DeVille · December 14, 2006, 6:38pm

Some OCR applications, such as Acrobat, allow one to check the OCR results for errors and to correct them before the PDF is finally saved.

That’s a time-consuming process, and can be frustrating, especially if one can’t see the original copy or see the context of an error.

I don’t bother to do that. I’m usually batch-processing a lot of paper. As a practical matter, the OCR results from good copy are sufficiently accurate to make the documents valuable in my database. Although it’s possible that a single critical word in a document might be unsearchable, there’s almost always enough redundancy that I can find what I need, anyway.

I’ve noted with some amusement that there are often more typos in the original copy than OCR errors – even in published material.

I don’t bother to correct original typos, either.

fbrzvnrnd · December 23, 2006, 10:32am

I think scanning many old legal papers with some important datas, like names or numbers. In that case the possibilty to edit the text before put it in pdf could be good to correct this datas.

f.

Bill_DeVille · December 23, 2006, 7:07pm

If you take a look at the workflow presented by current OCR applications to check for conversion accuracy before PDF conversion, I think you would agree that it’s not practical.

But if you’ve captured an important document and have reason to believe that there are OCR errors that affect searching and analysis, the procedure below is much, much faster.

Select that PDF, then Data > Convert > to plain text. This will create a new document in the database with the same name as the PDF, except that it has a .txt extension.

Spell-check the document. Compare names, dates, or other important elements to the original PDF and change them if necessary. The result is correction of OCR errors in the text version. Leave that text file in your database along with the PDF version.

Now, when you search for a person’s name, for example, it will be found in that text version even if it was ‘hidden’ by an OCR error in the PDF+Text file.

Does it sound like I’m creating lots of duplicated documents in my database? No, that’s not the case. In reality, I’ve got less than ten ‘twinned’ files to handle OCR errors resulting from poor quality of the original paper copy.

There are two reasons for that:

[1] OCR errors have no impact on the text you read on-screen or from a printout of your PDF documents. What you read is an image of the original paper copy. In fact, you can read PDFs containing my handwritten notes (that is, if you can read my handwriting; sometimes, I can’t, either).

[2] OCR’d documents from reasonably good copy have few errors, and most documents have sufficient redundancy that a query will find the document anyway. In cases where that isn’t true, such as an important name to be searched for, one can enter the correct spelling of the name in the document’s Comment field, or create a plain text twin with OCR errors corrected.

There’s no need to be obsessive about correcting each and every OCR error, as a practical matter. But if you are working from an old, yellowed and smudged piece of paper you may want to make a text file ‘twin’ to check for critical conversion errors.

Example: I scanned and OCR’d a magazine article that contained references to Superfund hazardous waste sites, some of which I had investigated. Later, I did a database search and that document didn’t come up in the results, although I remembered it.

On checking, I found that the magazine article had misspelled a Superfund site’s name. Duh! OCR is unlikely to correct typos in the original. So I just inserted a note in the Comment field for that document, including correct spelling of a site name.

If you are scanning a historical document, such as the death certificate of Elijah Somerwell (whoever that may be), and the paper is old and yellowed, it’s prudent to do a Find in the PDF just to make sure that a search for Elijah Somerwell finds those terms.

But perfection is the enemy of the good. Don’t waste time.

On the other hand, I’ll never forget an amusing incident. I was packaging up print-ready copy of a book to send to the printer for publication. We had used a very experienced editor to help prepare the material, which was one in a series of bibliographies on science and technology policy. As I was closing the box I took one final look at the cover page copy. The editor’s name was misspelled! She had done a great job on the project. But the cover page was done late in the project, she glanced at it and missed the error in her own name. So did the rest of us, who ignored that obvious error, as well.

annard · December 23, 2006, 8:37pm

I want to add that one of the strengths of DT is helping here too, where it relates similar words. When we were doing a demo at the CeBIT, we once completely crumbled a page to scan with the ScanSnap (it was in German). And I was still able to find specific words using the fuzzy search even though some were garbled during OCR.