webarchive PDFs not indexed?

Hello:

I think PDFs added to DT from safari as webarchives aren’t getting indexed.

I’ve put an example DT database at

[itee.uq.edu.au/~ianm/dt/](http://www.itee.uq.edu.au/~ianm/dt/)

It contains copies of the PDF at:

[lcsd05.cs.tamu.edu/slides/keynote.pdf](http://lcsd05.cs.tamu.edu/slides/keynote.pdf)

In ascending modified order these were added as follows:

  • script from safari (Add web archive to DEVONthink)
  • save from safari to desktop and drag PDF to DT
  • DA action menu (Add to DEVONthink > Web Archive)
  • DA action menu (Add to DEVONthink > PDF paginated)

Strangely the third of these (DA saving as Web Archive) resulted in a Link rather than a webarchive. The second and fourth are correctly identified as duplicates.

The real problem is that the first (webarchive from Safari) doesn’t appear to be indexed and is an order of magnitude larger than the others (567KB rather thn 42KB). The search problem is shown by searching for “Bloch” (the PDF author) - only the second and fourth versions (dragged from Finder and PDF from DA) are found, but the much larger first one should also be found (and presumably also the third one if it was added as a webarchive rather than a link).

As a workaround I can add PDFs via DA only.

thanks

Ian

Ian, you can’t save a PDF as a WebArchive file. They are different file types.

The command or script to save a Web page as a WebArchive file results in the capture of the HTML source and images into the WebArchive package file, so that one can view the page and see the images even when offline.

But that command or script doesn’t disassemble a PDF file into its text and graphic components, which would be quite a trick, then reassemble those components into the WebArchive file type.

If I want to capture a PFD file that has bookmarks and hyperlinks, I would avoid all of the techniques you used except for saving the PDF to the Finder and importing it from there into DT Pro. The DA command to add to DT as paginated PDF is equivalent to printing to PDF, which doesn’t currently retain hyperlinks in the PDF.

It is possible to capture a “complete” PDF using a script, but currently the PDF is stored in the body of the database rather than in the internal Files folder. That would increase the memory footprint of the database were I to import many big PDFs that way, so I don’t do that. In version 2.0, PDFs captured in that way will be stored in the internal Files structure.

Thank, Bill.

I hadn’t noticed the problem with the links in the PDF (most of the PDFs I capture lack links)

The reason I’m interested in importing from the browser rather than the Finder is to keep the source URL. When the rush is over, I’ll think about a script to do this.

thanks again

Ian