"no text" in DT log after importing pdf?

Not sure what it means when the DT log shows an entry of “No Text” after importing a pdf.

Any clues?

I think it means that pdftotext (or TextLightning?) didn’t convert anything in the PDF file to plain text and the imported document will be an image instead of PDF+text.

Exactly. This happens e.g. when you import a scanned document without OCR text layer (like the one generated by Canon scanners).



That was my first understanding as well, but I got the “no text” message for a few PDF files that definitely contain text, the proof being that I can search in them with Apple’s Preview application.

For an example of such a file, go to mensa.de/ and click on the color image in the right column. That will download a PDF file which according to DEVONthink contains no text.

The PDF you’re referring to here is protected by a password. The pdf2text engine we use cannot open password protected PDFs at all, not even for just reading the text. So, DEVONthink simply imports it as image.

To get access to the contents of the PDF, you should contact the creator of the document.



Strange! I can read it with Apple’s Preview, and with Acrobat Reader, both on my Mac and under Linux, without any password. I can also read it page by page from within DEVONthink, but the words are not indexed.

Reading is possible as well as printing, but e.g. editing and content extraction is prohibited. Open it with Acrobat (not Reader) and try to edit or copy some text. Doesn’t work.



Yes, the owner of such PDF documents has password-protected the file to prevent one or more user actions, such as copying the text or printing the document.

There are many ways in which a determined user can bypass such protection of PDF documents. Doing so, however, can raise ethical and legal issues, especially in light of recent expansions of intellectual property legislation in the United States. As Eric notes, the safest course may be to contact the owner of a copy-protected PDF file to request permission to copy it. I have done so in several cases.

Traditionally, scholars and researchers have been able to make “fair use” of copyrighted material, such as copying and quoting excerpts (with appropriate attribution).

My own attitude is that when I import or index a PDF file into my DT database, my intent is covered by fair use. I do not intend to republish the material as my own or to redistribute a “broken” copy. My purpose is simply to incorporate the material into my searchable database for research purposes. All that said, it should be remembered that inappropriate use of copyrighted and encrypted material could get a user in legal trouble!

Here are 4 ways (but not an exhaustive list) to incorporate the text of copy-protected PDFs into DEVONthink:

[1] Print the PDF to disk (or directly to DT) as a PDF file. This won’t work if the file is print-protected. Also, this method loses attributes such as bookmarks and hyperlinks.

[2] The current version of TextLightning will create RTF text from most copy-protected PDF files.

[3] Some OCR programs (including ReadIRIS Pro) will make a new copy of the PDF file that can be read by pdftotext. (A lot of work, and with potential loss of document attributes.) Note: If a PDF file is image-only, and if the resolution of the image is sufficiently high, OCR can make a PDF copy with readable text.

[4] Quick and easy, but involving purchasing pdf-Recover ($27 US, http://www.versiontracker.com/dyn/moreinfo/macosx/23618), which will make a new copy of the copy-protected PDF file, with all attributes included. This program warns the user about legal restrictions on use. DT can easily read the text of the converted PDF file.

Adobe has not modified the security features of Adobe Acrobat, though they must know about these and other measures that can break “secured” PDF files. So, those among you who may depend on Acrobat’s security measures, be warned! :slight_smile:

Thanks for those explanations. I wasn’t aware of the various protection schemes in PDF, nor of the ways to get around them. Your post made it into my DEVONthink database :slight_smile:

I’m confused by your comment since it seems to contradict the recent DEVONtalk newsletter and some MacInTouch reader reports claiming the Canon LiDE scanners do generate the OCR text layer when creating PDF documents. If true, is the only way to do that through the CanoScan Toolbox? Normally I scan using the ScanGear CS driver plugin from GraphicConverter, then select the output file format in GC.

I wanted to check on that capability of CS Toolbox before retroactive installation (it’s meant to be installed first so the driver installer locates it as one of the possible plugin targets, but I can do it manually) since I’d rather not use that software if possible. And what advantage (if any) would that have over scanning with Readiris 9?

Thanks for any info.

Ehrr, yes, in my message “without” should read “with”. Sorry :frowning:

Yes, this only works when you use the CanoScan Toolbox to scan a document directly into a multi-page (multi-scan) PDF.

The advantage is only that most people with a CanoScan have the CanoScan Toolbox, not ReadIRIS. Me, for example :wink:



So if I purchase textlightning, select the text box in prefs to have textlightning doe the pdf conversion will I then get pdf_text files from protected pdf files?

Some, but not all. That’s why I got pdf-Recover. :slight_smile:

The current version of TextLightning does a pretty fair job of retaining the formatting and layout of the original PDF (as much as Cocoa text allows), so the converted text looks nicer than the results from pdftotext. There are times when that’s useful, so I’ve got TextLightning, too. Note, however, that pdftotext is faster than TextLightning, as one would expect. There always seem to be trade-offs. :slight_smile: