Searching through PDFs created by Acrobat, poor results

I’m using DT Pro 1.3.beta3, and Acrobat 8.0.

I have a PDF that I downloaded from an online journal storage. I OCRed the PDF using Acrobat 8.0.

When I index the resulting PDF, and try to search through it, DT Pro is not able to read it. For example, within the PDF is the phrase “special reverence”. If I search for this phrase in Acrobat, it finds it correctly. If I search for it using Spotlight, it finds it correctly. If I search for it in the file in DT, it does NOT find it.

Curious to what’s happening, I highlight the two words in DT Pro, copied them, and pasted them. They came out as “s p e c i a l r e v e r e n c e” (lifted directly from the DT Pro file); in other words, with spaces throughout. When I did this same copy-paste experiment from the PDF in Acrobat, it comes out fine. However, strangely enough, when I tried the same in preview, it gave me the same results as DT, even though Spotlight was able to find it as mentioned above.

I assume the problem is with the fact that the PDF – originally an image – is typeset long ago, and has a bit more white space between the typeset letters than modern pritings, perhaps. Thus, when it is rendered by Acrobat, Acrobat’s search logarith might well be able to search and interpret correctly, as does Spotlight’s, but DT’s can’t.

Two questions:

  1. Does anyone have any advice – Acrobat side – on how to mitigate this problem?

  2. Since one doesn’t always have full control over PDF’s that one receives, maybe Christian and crew could see why it is that DT’s search engine isn’t able to search as effectively as, say, Spotlight’s, in this instance. If possible, maybe this could be tweaked for a future realease.

Again, I do recognize its an OCR rendering issue beginning in Acrobat, but – again – if one wants to rely and trust DT to search and use its AI features on any printed material one puts in it (such as PDFs), then DT should be able to search and find at least as efficiently as Spotlight and the major PDF producing software out there – Acrobat.

If Christian or anyone else would like me to send them a sample PDF page where this occurring, I would be happy to.

Thanks.

The text conversion (and therefore the search results) depends on pdftotext/PDFKit. Which one is selected in your preferences?

My preferences show that “PDFKit (Tiger)” is selected.

You might try pdftotext - sometimes it’s returning better results, sometimes worse.