text in some pdfs is not indexed by Preview or DTP


I’ve noticed that some pdfs I get from academic journals have selectable text, but Preview and DTP cannot search the text. That is, I type in a search term and I get 0 hits.

When I open the same pdfs in Acrobat, searchng works just fine.

Does anyone know why this is happening? Is there an option in Acrobat that “locks” the text away from pdfkit. If so, can I unset this option and then save it so that DTP can import it and actually see the text?

I suppose that I could upgrade to DTPO, buy a scanner, print out the pdf, and then scan it in with OCR. But it seems wasteful to print out a file for the express purpose of scanning it back in.

I’ve noticed that some PDFs produced by Acrobat 8 display lots of extra spaces in words when read by Apple’s PDFKit, which is used by the DEVONthink applications.

Try this with one of your ‘unsearchable’ PDF files. Select it and choose Data > Convert > to (rich or plain) text.

Examine the resulting text file. Does it have extraneous spaces?

Although PDF is close to a universal standard, Adobe has made some changes and I would expect PDFKit in a future OS update or upgrade will handle the changes.

When I create PDFs I usually save them as PDF version 1.4, to prevent such problems.

I will say that based on my experience with DTPO, you don’t have to print out PDFs and then scan them back in to get the searchable text. I do it all the time by just using File -> Import ->Images (with OCR) and point the file browser at the PDF file. During import it does OCR on the file.

Those menu commands are greyed out. I think it’s because when I import them, the log says there was no text.

When I look at the document properties for the pdf (in Acrobat), it says “PDF version 1.3” and that it was made by Distiller for Windows 3.01 (shouldn’t matter, right?). Also there are no document restrictions. How do I change it to PDF version 1.4?

:open_mouth: Really? Sweet! DTPO might just do the trick then.

[1] It’s pretty clear that the PDFs from your academic journal source are image-only (just pictures of text). So there’s no text to convert, hence the command is grayed out when such a file is selected.

You would have to run image-only PDFs through OCR to convert them to PDF+text.

Most academic sources are now supplying PDF+text files rather than image-only files, as there’s a big advantage to users if they can search and/or extract text.

[2] Versions 1.3 and 1.4 are simply earlier version of the PDF “standard” file format. There’s little difference between them. Both are just about universally readable on any computer platform. Both can have image-only (just a picture of text) or PDF+text (searchable text) – that’s not the difference between them.

There are more ‘flavors’ of PDF than most people realize. For example, there are special standards used for color management in pre-press PDFs going to printers who will print magazines or brochures where color management is very important.

Adobe has been moving Acrobat 8 toward group or consultative features and the highest-numbered versions of PDFs may have some compatibility problems with Apple’s current version of PDFKit, for example.

As software evolves, every now and then compatibility problems rear their ugly heads. it’s happened before and will happen again. Usually everybody ends up adapting to changes and the compatibility problems disappear until the next time features are changed.

In the meantime, I’m not updating Acrobat 7 to Acrobat 8. :slight_smile:

No, that is not true. I can use selection tools to select the text in Preview and in Acrobat. And my searches in Acrobat do find words that I’m looking for. But here’s something interesting – If I select and copy the text in Preview, and then try to paste it in TextEdit, I get blank spaces. If I do the same thing with Acrobat, I get the text. So it’s definitely a matter of access: pdfkit is allowed to display the text, but not allowed to copy or index it. I wonder if this has to do with copyright protection by the publisher? But if that’s so, then it’s terrible protection because Acrobat can still grab the text.

:frowning: Oh, that’s a bummer. I don’t suppose there’s a way for Acrobat to render each page as an image? I guess few people would ever consider that to be a feature. I’ll have to pull GraphicConverter off the shelf and see what it can do.

One more question about DTPO, can it do OCR on a file that it’s imported? When I import these “non-searchable” files, the log says there’s no text, but I can clearly read it. Does that mean that DTP somehow rendered the text as images? If so, maybe OCR on the imported file would work?

Thanks for helping to demystify pdf, Bill. :slight_smile: I’ll look forward to an updated pdfkit in future OS’s. I have a tangential question. :question: Why is it that Apple can include built-in pdf conversion in OS X, allowing you to create pdfs in any print menu, whereas PC’s don’t have this built-in capability? Does Apple have a deal with Adobe that allows them to do this? My Mac-using students can send me their lab reports in pdf, which I prefer, but PC-using students send Word documents, which can be a real hassle. But I can’t require them to send me pdfs, because it’s not built-in to their computers. Just wondering if you knew.