How to correct mistakes after OCR ?

matschx · January 6, 2008, 12:03am

After importing already existing pdf-files to DT PO with OCR I would like to correct the reading mistakes in the new pdf that I keep in DT. When I convert the new pdf-files into text-files the mistakes turn visible. I have to correct them to be able to search for the terms easily - but how? I do not want to convert the new pdf-files into text-files and keep both files, one to have the right spelling and one to have the layout. I tried the spell check for the pdf, but DT just found the first mistake but could not correct it nor continue to the next. So I opened the new pdf-file with Acrobat 8 Professional because it has a function to show the mistakes that happened during the OCR-process and to correct them, but it cannot repeat the OCR-process because it has already be done and it cannot find any mistakes.
So, is there a possibility to correct these mistakes in DT OP? Or would you suggest an other application?

Thank you!

Bill_DeVille · January 6, 2008, 12:30am

I’ve been wishing for a very long time for an application that would let me correct OCR errors in the text layer of a PDF without changing the image layer. The image layer is a faithful representation of the original and should not be changed.

But I’m not aware of such an application.

The only workaround is to create a text file version of the PDF using Data > Convert > To Plain or Rich text, and use that copy for searches, excerpts for use as quotations, etc. If it’s rich text, a hypertext link to the PDF can be inserted.

Adobe’s Acrobat has very rudimentary text editing capability, which is much too clumsy for extensive editing – and it changes the image layer.

matschx · January 6, 2008, 1:17am

Thank you for this quick answer. Could you perhaps - as I am a DT-beginner who tries hard to improve his skills - inform me about how I create a hypertext link to another file that is in my database. I managed to transform some words in the text-file into a link to the pdf-file, but could not do the same with the pdf, because I could not transform anything into a link. Is there no way to do it? Of course I will find the text-file whenever I search a word that is in the pdf, but it would be great to see immediately that I have already transfomed the pdf into a text-file for that I do not do it again when I want to work with the text of the file and have many similar copies in the end.
Is there a good way to mark two items of the database as belonging together in a case like this?

Thank you again for being my private teacher and helping me to take off with DT !

Bill_DeVille · January 6, 2008, 2:01am

True, you can’t create a hyperlink from a PDF document.

But the text file has the same name as the PDF from which it was converted. So you can find the text file from the PDF by selecting the Name of the PDF file and pressing Command-/ (the Lookup command). That will open a Search window with the Name already entered. To make the search very fast, check the Name option.

And of course you can also place a reminder that a text conversion file exists, in the Comment field of the PDF’s Info panel.