PDF Import - "PDF-to-Text" or "Text

adrians · April 2, 2003, 11:01pm

Hi all,
I have a lot of pdf files that I want to include in my DEVONThink Database. I’m wondering what the difference is between using the simple “PDF-to-Text” tool versus paying the extra to get “Text Lightning”.

I know Text Lightning will give me RTF but what practical benefit does it add to have the contents of the PDF files available as RTF as opposed to plain text?

If I want to print a PDF that comes up in a search (for example) I would print it from Acrobat. If I want to get RTF I have the full version of Acrobat which can do that…

So is the any point to getting TextLightning? I want to know what I am missing out on now before I start adding all the files to DEVONThink so I don’t have to go back and do them later if I decide there is something worthwhile in having the RTF version.

Thanks,

Adrian

cgrunenberg · April 3, 2003, 7:18am

If you don’t need the conversion to RTF, see "Import" preferences, or don’t plan to use TextLightning in combination with other applications like TextEdit, there’s usually no benefit of using this application.

Actually the combination DT+pdftotext is faster as pdftotext just converts to plain text and is a simple CLI utility.

Bill_DeVille · April 3, 2003, 11:48pm

Adrian:

The built-in PDF-to-Text feature in DT works precisely as Christian stated in his response to your query. The text in a PDF document dragged into DT is available to the DT Concordance and Search Tool.

If you do a search on the DT database for a term contained in the PDF file, it will be listed among the search results. If you highlight the listed file, its first page will be displayed as an image in the second column of the Search window (even if the first occurrence of the search term is on page 42 of the PDF document).

That’s useful; I now know that this particular file may be relevant to my search, and I can either read it by opening the DT content, or launch the original PDF file under Acrobat.

But for my way of working, that’s not quite satisfactory. I prefer seeing the context of the search term, and where it occurs within the document. That’s why I prefer using TextLightning to convert the PDF to RTF.

RTF produced by TextLightning often doesn’t look good, especially for multicolumn PDF documents, and paragraphs tend to get broken. Graphics are gone, and tables are not displayed properly. I can live with those shortcomings, because I can see the text immediately within DT, and I can copy excerpts, add notes and comments, etc. In a future version of DT, I’ll even be able to add links to other DT contents or to external files. That’s why I prefer TextLightning for grabbing PDF file text into DT; I can do many more things with RTF content than with image content.

Perhaps, in the future, Christian will have DT read PDF files with RTF plus images and tables!

Bill

neural · June 16, 2003, 1:57am

Hi all,

Just a follow on question. Like Bill_DeVille said its useful to use TextLightning to convert to RTF because you can see the search result in context, however under this configuration the original PDF file doesn’t get copied to the DEVONThink\Files folder.

As I have the ‘import to folder’ feature turned on I would expect all PDF files regardless of what method I have used to import the content into the database to be copied to the relevent folder. Currently the system converts the file to RTF and then creates a link to the original PDF file (which is usually on my dektop). As I normally delete these files from the desktop once imported I am unable to view it again because DT still looks at the dektop to find the source.

Is this a bug or just the way it was designed.

Thanks

Bill_DeVille · June 16, 2003, 6:10am

Neural:

You are correct. DT/TextLightning import of PDF files results in incorporation of the RTF text into the database, and a link to the original PDF file.

That’s not a bug. I’ve got thousands of PDF files, not all of which I’ve yet imported to DEVONthink. When I import a PDF to DT, I want to keep the original PDF intact for several reasons. Many of my PDFs contain bookmarks and have clickable links, so they are best read under Acrobat. Many of my PDFs are large documents and have important graphics, tables and chemical or math expressions – again, best read under Acrobat. As PDF, I can easily email files to others.

DEVONthink lets me catalog and ‘knowledge mine’ my reference files. DT provides extremely fast searching of text content, and through contextual analysis suggests similar items – which I find extremely helpful.

That gives me the best of both worlds; the integrity of my original files (PDF, Word, etc.) is maintained, yet their searchable text is incorporated in DT’s database so that I can find and go to items of interest immediately. I’ve found DEVONthink to be very stable and reliable, but I wouldn’t trust any database as the sole repository of my reference file collections. I’m pleased by the way DT handles my file imports.

Fred · July 6, 2003, 6:53pm

Bill:

First…I’ve learned a lot about "best practises" uses of DT by reading your posts on this board. Thanks.

I understand this behavior, and why it works the way it does. However, since DT links to original files (PDF or otherwise) via a static pathname, rather than creating an alias…what do you do when you want/need to reorganize your Finder files/folders hierarchy? My organizational needs evolve. Re-organizing the structure of my PDF library would then break every DT link…

What I really want, I suppose, is to import the text of anything into DT, simultaneously creating an alias to the original file. I could then pretty much make DT my primary "Finder," with only occasional need to dig into the archives (aka the real Finder) for the original file in original file format. I’d be free to reorganize the Finder hierarchy of my files, and still use DT to find them as needed.

Maybe there’s a way to do this today and I’m overlooking it?

–Fred

cgrunenberg · July 6, 2003, 10:51pm

Not at the moment. But in one of the next releases (probably v1.8) we’ll split the current URLs into URLs and paths/aliases. This should fix your problem and provide the possibility to attach additional URLs to any content (even those referencing local files).

Fred · July 6, 2003, 11:42pm

[sound of loud cheering]

Awesome!! :D :D

[wild applause and whistling]

Made my day, thanks!

Bill_DeVille · July 7, 2003, 1:09am

Fred:

I share some of your concerns about path links from a DEVONthink content to the original file (PDF or whatever). Alias links would be better, as they can allow reorganization of the original file storage scheme without breaking links. (More on that in a moment.)

My TiBook has a 60 GB hard drive, which gives me considerable free space (although I’m using it rapidly – have only about 20 GB free at the moment).

I created a folder called "Fodder for DEVONthink" into which I toss almost everything that is imported to DEVONthink. I do my organizing via DEVONthink categories, and depend on DT to find things for me. I prefer not to incorporate the actual files into DEVONthink, believing that PDF files are best read/printed from Acrobat (many of my PDFs have bookmarks and hyperlinks), Word files are best read/printed from Word, and so on. This also makes it easy to share PDFs with others. The exception is Web documents. Unless a Web page has lots of math or chemical expressions or important internal hyperlinks (in which case I use Acrobat Web capture to collect the material as PDF), I select the Web text and use the DT "Lookup" Service to capture it into DT (also copy and paste the URL).

Back to the topic of aliases. NoteTaker 1.5 introduced what AquaMinds calls "Intelligent Alias Management." External files on your drive can be moved to another folder or partition without breaking the link. This seems to work even with notebooks created in early versions of NoteTaker – the links now act like aliases, without user intervention.

Christian, I was delighted to see your response about upcoming aliases in DEVONthink. How about automatic updating or “alias management” of the many hundreds of existing path links in my database – wouldn’t that be a great feature in DEVONthink?

cgrunenberg · July 7, 2003, 7:32am

Bill,

updating of valid URLs (meaning that the referenced file is where it should be) should be no problem at all.

Poetsfolly · October 9, 2003, 5:15am

Actually, I can think of a couple ways of making this a little more convenient.

DT could maintain a copy of imported files automatically. They could be stored in the file system for access by other programs, but the copy would be made automatically by DT during the import process. This could be limited to certain types of files.
A bit more code, but perhaps even better would be to make it possible to mount DT as a disk on the desktop, similar to *.dmg files. Then the DT database could contain both the *.pdf files and extracted text. The *.pdf files would be directly available to backup programs (Retrospect, etc.) in their original form as well as viewers such as Acrobat and processors such as email/spreadsheets/etc. You wouldn’t have to worry about losing the original files in a corrupted database.

cgrunenberg · October 9, 2003, 6:16pm

Thanks for the suggestion but DMG files just contain common file systems. To mount Devon databases, we would have to create a DEVONfs ("Devon file system") kernel extension. Lots of work but maybe in the long run… (as a file system together with some Finder and OS X plugins could add some of the functionality of DEVONthink to OS X).

Poetsfolly · October 10, 2003, 12:10am

Yes, .dmg probably isn’t the best example. A much better example is probably how the .mac idisk is mounted using
the mount_webdav command. If DT provided a webdav protocol interface to the database, then mount_webdav could make it appear as a file system. And, webdav would provide an interface to a bunch of web authorship and collaboration systems. Hopefully, there aren’t any osx deadlock situations mounting a local http server as a webdav file system.

There is a very nice overview of WebDAV as a front end for a database doc storage system:

webdav.org/papers/catacomb-apachecon2002.pdf

If I didn’t already have a job, I’d offer to implement it for you…