PDF import (no text)

rcarlan · November 18, 2008, 12:32pm

I have a 200+ pages PDF which contains mostly text - there are also some graphics (e.g. the cover page).

When I import this PDF into DEVONthink (using File | Import | Files and Folders), the Log window pops up and reports “No Text” in the Info column.

The PDF is not password protected and, if I open it in the Mac OS X built-in PDF Preview application, I can copy text from the PDF and paste it, for example, in the TextEdit application.

I have imported many other PDFs into DEVONthink without encountering any problems with the text extraction facility; this is the only PDF (thus far) that gives me grief.

Any suggestions on how to get around this problem?

I can provide the PDF for diagnosis, if required. The size of the file is approx 10 MB.

Regards,
Radu Carlan

cgrunenberg · November 18, 2008, 12:34pm

You might check if switching from pdftotext to PDFKit (or vice versa, see PDF & PS preferences) will make a difference.

mattmc · November 19, 2008, 3:41pm

I’m getting the same problem but with all pdf scans and imports. Whether it is documents that have images and text, or just text - the log panel says there is ‘no text’ under the info column. I use a ScanSnap 500. I also have IrisScan installed 2 (not sure if that would make a difference).

I’ve tried selecting different indexing methods (pdftotext and PDFKit) but the same keeps happening. I’m not sure what else to try!

rcarlan · November 19, 2008, 10:23pm

I encountered this problem only with one particular PDF (thus far). There was nothing special about this PDF, as far as I could ascertain. It wasn’t protected in any way, and it wasn’t any larger than other PDFs which I imported without problems. Also, its content was similar to other PDFs that imported OK.

I followed Christian Grunenberg’s advice and switched from PDFKit to built-in pdftotext. This appeared to fix the problem - i.e. I was able to import the PDF I had difficulties with previously.

For what it’s worth, when using PDFKit, the failure to convert to text seemed related to size and available resources. With no other applications running (immediately after reboot), the import was successful for up to around 100 pages. It did not seem to matter too much which 100 pages - although the actual number of pages that imported successfully would vary slightly with content. With several other applications running (including a VMware Fusion virtual machine), the import failed at around 70 pages - the same 70 pages that would otherwise succeed immediately after reboot.

It seems pdftotext is superior (at least in this regard) to PDFKit.

Btw, are there any disadvantages in using pdftotext as opposed to PDFKit?

Regards,
Radu Carlan

cgrunenberg · November 20, 2008, 8:52am

Sometimes pdftotext is working better, sometimes PDFKit. Although pdftotext was a little bit more reliable, we’ll use only PDFKit starting with v2 (I guess we’ve ironed out all issues).