PDF Imports as Image?

julian · July 1, 2004, 2:28am

So, I’ve tried importing and indexing a couple of different PDF documents, but they appear in DEVONthink as type “Image” which is problematic because I can’t do any sort of Find on the actual text of the PDF. Has anyone seen this before? Is there a way to get a PDF treated as a PDF?

julian

cgrunenberg · July 1, 2004, 4:35am

Some PDF documents don’t contain any text (e.g. only bitmap graphics), might be encrypted or use the PDF 1.5 format (not supported by pdftotext 2.x used in DT 1.8.x). However, v1.9 will use pdftotext 3.0 and therefore support PDF 1.5 and all PDF character tables (e.g. Japanese).

ccheney · August 26, 2004, 1:40am

I was glad to find this message, after finding that I could not highlight or otherwise work with pdfs I had imported. [by the way, the Manual doesn’t address this problem: I followed its instructions, opening the original file in Preview and looking for checkmarks to indicate that some features were turned off, but it opened okay in Preview, tho’ just as uneditable as in DT. ]

I guess the next question is, when might 1.9 appear, and are there workarounds other than opening the pdf in Adobe Reader and trying to save a copy as text?

Cynthia

cgrunenberg · August 29, 2004, 4:04pm

Although DT 1.9 (will be released within 2-3 weeks) should be able to index/convert more PDF documents, it will be still necessary to open encrypted ones in Acrobat Reader for searching/copying (or use TextLightning 3.0).

And this situation will probably not change until Tiger and its PDFKit (similar to the possibilities of Panther’s Preview) will be available.

ldrosenblum · September 2, 2004, 9:04pm

Here’s a pretty quick workaround that seems to have worked for me:

In Acrobat (Professional, others?) save the file in postscript format.
Then open the ps file in Apple’s Preview which will convert to pdf.
Then save normally in Preview (which keeps it in pdf format).
And then bring this new file into DevonThink.
The new file is saved as ‘PDF + Text’ by DevonThink and is fully searchable.

Hope this works for everyone and for many of these problematic file.

cgrunenberg · September 3, 2004, 5:02am

Another workaround (not sure if this still works under 10.3.x) is to open an encrypted PDF document in Preview and to print the document (if possible) directly to DEVONthink.

Note: This requires the "Save To DEVONthink" PDF services script.

Wuddel · October 11, 2004, 4:21am

The funny thing is that I have tons of protected PDFs in my database and only one imports as an image. It is a paper from the EMBO Journal from 2001. Another EMBOJ paper from 2004 imports as PDF+Text.

Bill_DeVille · October 11, 2004, 6:17am

Julian (and others):

The two most common reasons that text from a PDF file can’t be imported into DEVONthink are:

[1] The PDF file doesn’t contain text – but only an image of text. Assuming that the image has adequate resolution, an OCR program such as Read IRIS 9 Pro may be able to convert the file to a PDF file containing text, that can then be read by DT.

[2] The file is encrypted so that full access to it can only be had using a password. Often, the copy-protection allows viewing, but not copying text (sometimes, not even printing the file). In that case, DT cannot import the file’s text.

Most encryption schemes can be broken, but sometimes only with great difficulty and the expenditure of time.

Sometimes the trick of saving the file as a PostScript file under Acrobat Professional, then opening the PS file as PDF under Preview will work. But it fails so often that I’ve given up on that approach.

Printing the file to disk as a PDF file sometimes works, but often "blows up" the resulting file size. The current version of Preview won’t print to disk most encrypted PDFs (early versions would happily break encryption in this way, however).

Some OCR programs can read and convert encrypted PDFs, but that’s usually a time-consuming task and may require editing to remove conversion errors.

The current version of TextLightning can read the text of many encrypted PDF files, and has the advantage of integration with DT’s preferences on PDF import.

I’ve settled on Ovis pdf-Recover as a relatively inexpensive ($24 US) application that can very quickly and easily convert files so that they can be read into DT. A MAJOR advantage of pdf-Recover is that the unencrypted PDF file retains any hyperlinking contained in the original file. (That, alone, is worth the price to me.)

Remember that there can be ethical and even legal implications about "looking into" encrypted files, especially depending on what use may be made of the information. I’ve had the original file owner’s permission to unencrypt most of the files that I’ve tinkered with.

Information about any of the applications mentioned above can be found on the Internet using Google.