Search PDF text

Knight_of_Nee · December 17, 2004, 12:40am

Can Devonthink search the text of PDF files without converting them to RTF docs? I use Acrobat Pro so that I can search PDF documents by contact, but I would rather just leave it to Devonthink.

Bill_DeVille · December 17, 2004, 1:39am

You have several options to import PDF files into DEVONthink, all of which result in text capture by DT (assuming, of course, that the PDF contains text and is not merely an image).

Option 1: Where do you want the PDF files to go? Set this in Preferences > Import > Files:
Make external link to PDFs;
Copy the PDFs to the Files folder in the database; or
Import PDFs into the database file.
(I recommend the first or second of these choices.)

Option 2: How do you want DT to “collect” text from PDFs? Set this in Preferences > PDF & PS > Index & Convert:
Use built-in pdftotext, check plain text; or
Use TextLightning (if it’s available), check rich text.

Actually, there’s a third way: File > Index, which creates a pdf image + text in your database.

Of the three text collection options, I recommend using built-in pdftotext to create a plain text version of the PDF’s text. Reason: If you do a “phrase” search, the plain text is immediately available to DT and the search is very fast. RTF text collection using TextLightning looks nicer, but in the past “phrase” searches were slow because each PDF had to be reprocessed (that may not be true in later DT versions). Another advantage of these options is that the search terms are highlighted in documents viewed in the search results. But if you’ve used the “Index” option, each PDF may have to be processed again, which slows phrase type searches (and probably some other search types, as well) – and search terms are not highlighted in when a search hit is viewed.

Hope this helps.

Knight_of_Nee · December 17, 2004, 1:57am

Bill_DeVille:

You have several options to import PDF files into DEVONthink, all of which result in text capture by DT (assuming, of course, that the PDF contains text and is not merely an image).

Of the three text collection options, I recommend using built-in pdftotext to create a plain text version of the PDF’s text. Reason: If you do a “phrase” search, the plain text is immediately available to DT and the search is very fast. RTF text collection using TextLightning looks nicer, but in the past “phrase” searches were slow because each PDF had to be reprocessed (that may not be true in later DT versions). Another advantage of these options is that the search terms are highlighted in documents viewed in the search results. But if you’ve used the “Index” option, each PDF may have to be processed again, which slows phrase type searches (and probably some other search types, as well) – and search terms are not highlighted in when a search hit is viewed.

Hope this helps.

Thanks for the suggestions. I really would like to have my search results shown in the original PDF format. It would take so much longer to find the hit in an imported text file and then link to the original PDF. I’ll try the “Index” option and see how I like that. Thanks again.

Bill_DeVille · December 17, 2004, 2:38am

Unfortunately, the search terms are not highlighted in the Search window view of a selected hit, nor does the view scroll to the document page on which a search term occurred, if Index import has been used. That’s why I like plain or rich text import. True, I prefer reading PDF under Preview, but that’s just a click away.

I’m hopeful that when the Tiger OS is released DT will be able to display PDF imports with much improved capabilities. (If Apple releases the proper hooks to developers.)

Knight_of_Nee · December 17, 2004, 3:22am

I think the best temporary solution for me is to merge the text and PDF documents into one. Now I just need to come up with a script to import a PDF, create a text version (convert to text), merge the two, and delete the originals so that I end up with one single document containing the converted text and the PDF document. Thanks again.

moses · December 17, 2004, 6:37am

This discussion reminds me of a suggestion I posted under a different topic a little while ago… I have been thinking about it and think it has some merit, so here it is again:

What about a feature where DT can copy a PDF into DT, convert it to uneditable Plain Text, and then invisibly link the two together, so that as far as the user is concerned they are one. [Actually, I am informed that this is already done, so here’s the new part…] When such a document is viewed a button could be displayed that says “View as PDF” or “View as Plain Text”. Thus, searches in such documents could reveal the actual words in the Plain Text mode, and then to see formatting and images the user could click the “View as PDF” button. This would save users from having to manage two documents, a plain text one in DT for searching and a linked PDF…

Of course, highlightable search results directly in the PDF (like in Preview) will be the best solution, but until then…

cgrunenberg · December 17, 2004, 10:25am

DT 2.0 will probably include an option for all contents to view either the full layout (PDF/HTML/RTF) or only a plain text. In addition, DT 2.0 will of course use the upcoming features of Tiger including enhanced PDF support (e.g. highlighting of occurences, scrolling to first occurence, maybe contents/bookmarks drawers etc.)

Maria · December 18, 2004, 1:16am

Sounds great. Will the option to view only plain text also include an option to watch XML as plain text?

Maria

cgrunenberg · December 18, 2004, 7:57am

Probably. But I’m not sure if this will be very useful as XML contains no layout information. And this effects even the conversion to plain text as the “plain text” view of XML will be just a concatenated list of strings (separated by line feeds).

Maria · December 18, 2004, 8:19am

Oops, I did not realize that DT does not read CSS stored in the Dbase itself, since my computer is always online. In that case, Style sheets stored for Web pages on the web do work. But HTML pages with CSS files stored in DT do not show any effect of the CSS – as I realized now. I have to export them and watch them in my browser to see the CSS at work.

I thought that I could store a default CSS in one group and link to them absolutely from my XML files. Without formatting, it is of no use of course.

Perhaps in a future version?

Best,
Maria

Knight_of_Nee · December 18, 2004, 1:21pm

Any updated info on when the Pro version will be out of beta and on my computer? Is Devon tech waiting for the next OS release? I keep putting off organizing all of my files until I get the Pro version, simply because it seems to have the features I really need. The wait is painful, but I am sure worth it.

CatOne · December 19, 2004, 12:45am

You know, the “Personal” version is upgradable to the Pro version, so you could always use what’s out there now (which is VERY capable) and then upgrade for just the price difference when it ships.

Ship dates in DT tend to… well… the software tends in general to ship later than initial ship date estimates, so if Q1 2005 is the announced date, I wouldn’t bet my life on something before June

But it’s VERY capable now, so why not use what’s already there?

cgrunenberg · December 19, 2004, 2:38pm

Will be available in January (assuming there will be no major hardware failures, diseases, earthquakes etc. )