Hi, Hilko. The problem you have encountered is not uncommon. It isn’t caused by DTPO’s indexing of the text content, but by OCR errors in converting an image-only PDF to PDF+Text.
The reason is that no OCR software can read and convert the text in an image-only PDF as accurately as can a human. The overall accuracy of OCR software has improved dramatically over the last ten years or so. But I don’t expect OCR to attain the human level ability to read and convert an image of text to computer-readable text within the next ten or twenty years, if ever.
If the scanned paper copy is “clean” (no marks or blemishes), was scanned at an adequate resolution for OCR (usually at least 300 dots per inch), the text image is a “standard” font that’s 12-point or higher, there are not mixed fonts near each other, and there aren’t graphics closely juxtaposed to text, you should expect very good accuracy, few if any errors.
I’ve scanned and OCR’d hundreds of pages from paper with no errors at all in the resulting PDF+text. That’s the ideal case, where the original paper copy meets all of the above criteria.
But I’ve scanned thousands of pages from paper that contain errors resulting from OCR problems, because the criteria listed above were not met in the original document.
In the vast majority of cases, transforming paper to searchable PDF+text in my database still makes the information content of those paper documents more accessible to me, even with some OCR errors. Overall, I’m delighted by the results.
I always send OCR’d material to my database as PDT+Text. That’s because the PDF+Text format is “self-validating”. Even an original paper copy in poor condition, leading to numerous OCR problems, can be viewed and printed (and interpreted by me), so that it will likely make sense to me. But in such a case, I’ll probably add notes to the document’s Comment field to help DTPO find it for me (just as I do if I send handwritten notes to my database as PDF images). If I need to extract the text from that document using Data > Convert > plain or rich text, I can edit errors by looking at the “original” image layer in the corresponding PDF+Text document. Even on my MacBook Pro screen, I can place the text version side-by-side with the PDF+Text version as an aid to proofing and error correction.
I believe I’ve bought and tested every OCR application available on the Mac since the earliest ones (I’ve owned just about every version of Acrobat Pro, including version 7.x but haven’t tested version 8 yet). In my own experience the IRIS 11 engine used in DTPO is overall the fastest and most accurate I’ve used.
The training and pre-editing features in some OCR software has never seemed useful to me. I’ve tried them. In my experience, training for one document is likely to lead to even more errors in the next document I wish to OCR, because the next scan will probably contain different fonts, etc. Pre-editing can be very time-consuming and (to me) irritating, so I don’t bother. It’s easier to edit a text conversion later on.
Bottom line: One must accept that while OCR can be very useful, the current state of the art is not error-free for a number of reasons – and it will be a long time, if ever, before error-free OCR might be attained.
Near the top of my wish list would be an application that would allow me to correct OCR errors in a PDF+Text document without changing the existing image layer. I’m not aware of such an application, but I hope someone is developing it. 