How well can DEVONthink find words in PDF files?

Eric · May 26, 2005, 2:44am

I just installed DEVONthink, and thus far I’m pretty disappointed. The biggest reason that I purchased DEVONthink was so that I could deal with the massive number of PDF files that I’ve accumulated. I wanted to be able to search the contents of the PDF files rather than just their titles. However, DEVONthink doesn’t seem to work well at all for this. For example, when I search on an author of a journal article - something that’s even present in the title of the file and that is a very unusual word - it can’t find it. Am I doing something wrong, or is this just a limitation of the program?

Thanks.

Eric

Bill_DeVille · May 26, 2005, 5:06am

Eric:

It’s not a limitation of the program. DEVONthink can read the text of most PDF files very well. But that doesn’t necessarily mean that you’ve done something wrong.

There are two kinds of PDF files from which DEVONthink can’t read text:

[1] Image-only PDFs. There’s not any text to read, just a picture of text. Very few journals distribute their content in image-only form nowadays, but I have seen this – one case being a statistics journal. If the image has a high enough resolution – 300 dpi would be nice – an OCR program such as Read IRIS Pro can convert the file to one containing text.

[2] PDF files with security enabled, usually copy protection. In this case, the distributor of the PDFs has used an Acrobat security feature that prevents the user (and DEVONthink) from copying the text. There have been a number of discussions of this issue on the forum. Search for the string “copy protection” and you will probably find them.

During file import, DEVONthink keeps a log of import failures. To see this log, select Tools > Log (preferably right after the import operation). For example, if DEVONthink encounters a copy-protected PDF file, it will note in the log that the PDF contains no text.

It is possible to set up searches that give incorrect negative results. One clue is your mention of not being able to find a file in which the author’s name is part of the file name. Let’s assume for the moment that DEVONthink successfully imported this document, and it’s in your database. (It’s a good idea to read the user manual about searches.)

Select Tools > Search. Click on the Options button to examine the search operators. We’ll see that there are many possible searches that would fail to find text that is in fact in the database. I’ll list just a few here:

[1] Do an AND search for “AuthorName” and “Foo”, where "AuthorName really is in the database, but “Foo” is not. Since the result is true for one term and false for the other, AND will be false, and no result will be found.

[2] Search for “AuthorName” in the Comments field rather than in the ALL list – and “AuthorName” appears in the Name and also in the Contents of one or more documents but not in the Comments field of any document. Another false negative.

[3] Set up an otherwise correct search for a text string that’s present in the database, but search in a group in which no document contains the search term(s). Again, a false negative.

Suggestion: Select File > Index, then select one of the PDFs you wish to import into DEVONthink. That PDF with a file name containing an unusual word would be a good selection. Now select Tools > Search and set the options to ALL, All Words, Ignore Case, Database. Enter just the unusual word as the search string. Any luck? If so, select that document. On the right , above the document text window, you should see how many paragraphs, words and characters were read from the PDF file.

Hope this helps.

Maria · May 26, 2005, 5:36am

Hi,

Bill’s explanations were exhaustive as always. There is just a little idea to add: The info panel (and the list views) shows you whether an item is a group, picture, text, RTF, PDF+text or what else.

If the PDFs you want to search are not “PDF+text”, you cannot search them for the reasons Bill mentioned in his post.

Maria

Eric · May 27, 2005, 1:20am

Thanks so much for the replies. I’ll try these things out. I really appreciate the comments.

Eric