Search oddities with .pdfs - cautionary tale

Not sure I’m going to be able to explain this well.

I have a .pdf on my computer, within which is the text “Do low cost carriers intensify airport competition”. In fact, that is the title of the document (not the name of the .pdf, but the main title of document.

If I search for “intensify airport competition” with the following search parameters:

ALL - PHRASE - ignore case = finds it no problem
CONTENT - PHRASE - ignore case = finds it not problem

If I search for “Do low cost carriers”, here’s what I get:

ALL - PHRASE - ignore case = does not find it
CONTENT - PHRASE - ignore case = does not find it

I then trying putting “Do low cost carriers” in quotation marks:

ALL - PHRASE - ignore case = does not find it
CONTENT - PHRASE - ignore case = does not find it

Thinking that there might be something a little amiss with the .pdf, I converted it to multiple TIFF docs (one for each page of the original .pdf) and then recombined in Acrobat 8 into a new .pdf. Same results.

So, doing a little more experimentation, I found that it turned out to be the rendering of the original .pdf itself.

For example, if I have the pdf open and copy what I see on my screen as “Do low cost carrers” and then paste it into DTPO’s search, this is what shows up: "Dolow cost carriers " When I ask DTPO to search, it finds the pdf no problem, but does not find “Do low costs carriers” (notice the spacing).

So, just in case anyone happens across this, it is not DTPO’s fault - it is how the pdf was rendered when it was created. Unfortunately, this might make quite a few of my pdfs un-searchable, but not a huge amount I can do about it.

Cheers ears -

David

David, I’ve seen some apparently similar problems with PDFs created by Acrobat 8. In those files, PDFKit and Preview were unable to reliably search the text of the PDFs.

When I converted such a PDF to plain text within DT Pro, there were many extra spaces inserted in words. But Acrobat 8 could save the text without those space artifacts.

My guess is that PDFKit isn’t entirely compatible with Acrobat 8 files. Hope that gets corrected soon. In the meantime, I’m holding off on Acrobat 8 and sticking with 7.

Thanks, Bill. Might do a little more research on this and post back what I find. Odd that that 8 and PDFKit don’t play well together.