Storing as HTML, Web Archive, PDF...?

anthrovisual · March 14, 2009, 2:07am

I’m beginning to gather a great deal of web pages that I would like to store for the long run in my database, yet I can’t decide which format is best.

Devonthink 2.0 seems to allow several formats: bookmark, html, web archive, and pdf (that will allow all the info to be viewed, not just the text).

What are the benefits and setbacks of each and which should I use.

My interests are preserving the pages as-is, easily searchable, possibly editable in case I want to highlight, interactive if there are links, and of course will stand the test of time.

Bill_DeVille · March 14, 2009, 2:30am

There’s a fourth capture mode, rich text capture of selected text/images/tables from a Web page. That’s the one I almost always use, as I can restrict the capture to just the material I want in my database, avoiding extraneous elements such as ads or other, unrelated text.

Note that if the DT Pro Services are used, e.g., in viewing a page in Safari, one can restrict the page content for a WebArchive capture by selecting just the desired portion of the page and invoking the Command-% keyboard shortcut. (Not available in Firefox.)

While PDFs also retain the images and text as of the time of capture, whether or not that page subsequently disappears from the Web, if one wishes to extract text from a PDF, those hard line endings will require editing in order to properly insert a quoted excerpt into another document.

HTML captures do not download images to your computer, so they will not be available when offline or if the page subsequently disappears from the Web.

anthrovisual · March 14, 2009, 2:53am

So it sounds like rtf selection or select web archive are the way to go. I did not think about being selective in the capture, and am now realizing its importance upon looking at some pages I captured a long time ago.

I suppose my question then goes to the difference between rtf selection and selective web archiving.

I might be able to figure out the difference myself with some testing but any thoughts would be helpful.

Bill_DeVille · March 14, 2009, 5:29am

I’ve got tens of thousands of rich text clippings from Web pages, and find that selecting just the material I want improves the focus of Search and See Also, as well as saving space.

Selecting some text and images from a Web page as selected WebArchive or as RTFD obviously brings down the same content, and the source URL is captured in the Info panel of the document in either case. They are different file types, though. Storage space will be pretty much the same.

Obviously, I can add notes and hyperlinks to a rich text document, but not to a WebArchive document (unless I edit the source code, which is more of a pain).

anthrovisual · March 14, 2009, 11:45am

So with editing capabilities and perhaps a more universal format, especially since I do not think I can send a web archive to a windows friend (at least not from what I recall), it seems that rich text clippings wins out.

Thank you for your help. I’m glad I asked.

Now, if only you could help me with the information overload I am experiencing now that I can capture all this data