Best format for importing a web page?

rmathes23 · December 31, 2008, 9:30pm

If I have web content I wish to import, I can import it as a web archive, a bookmark, a PDF or a RTF. I tried to see if there was a single topic discussing the relative merits of these different formats but couldn’t find one.

So, I thought I’d ask, and I apologize in advance if this has been already discussed at length and I simply couldn’t find it.

Any suggestions on the best way to do this? I’m starting to think that perhaps capturing it as a RTF might make the most sense. I have the ability to edit out unneeded information, seems to skip importing a lot of the eye candy from a page, etc…

just curious how other people prefer to do this and why.

And happy new year to everyone! May 2009 be calmer and saner than 2008!

kewms · January 1, 2009, 12:10am

I generally use RTF, for exactly the reasons you give.

I’ll occasionally use a bookmark, for instance if the page is primarily an index to other pages that I want to examine more closely later.

Katherine

rabourn · January 1, 2009, 3:40am

I’ve wondered the same thing however I need to capture the state of a web page and I want an exact snapshot of it. Saving as RTF frequently doesn’t preserve everything exactly as viewed. In the past I’ve used the apple script installed with DT extras “add page to DevonThink” but I also have two other scripts: “add web archive to DevonThink” and “add web document to DevonThink.” I’m not sure which to use. Using “add page” seems to reload the page every time I view it. However, I’m still able to view it when I’m not online. So, I’m not really sure what the differences are among these three options.

sjk · January 1, 2009, 5:14am

There’s been recent discussion about different formats and capture methods. I wasn’t able to find some specific posts (sorry) and, in general, can certainly sympathize with the difficulty of finding that type of information. Earlier today:

A more prominent location for keeping an updated list of open issues and other recurring, FAQ-like info is desperately needed.

And last week:

[i]Why not create an easily referenced sticky thread about submitting bug reports (and other oft-repeated FAQ-like information? Seems silly to post it where it’ll eventually get buried, then redundantly reposted again in some other thread, buried…

Anything to help cut down those apply, lather, rinse, repeat cycles here, eh? [/i]

I hope other people will contribute to that suggestion if they see its value to the DEVON community so I’m not the only one making it.

Jones · January 3, 2009, 8:02pm

RTF was developed and is owned by Microsoft.

Althoguh PDF was developed by Adobe, it has now gone through the international standards process and is defined in ISO 32000-1:2008, as an open standard.

I’m not making a value judgment here, just offering info that may be useful for some of you.

sjk · January 3, 2009, 9:07pm

And the “Web Archive” format is Apple developed and still OS X-only (AFAIK), making it more restrictive than RTF.

rabourn · January 3, 2009, 9:51pm

So, if your goal is to use the most open standard, then PDF might be the right choice. What if your goal is to capture a page exactly as it appears at the moment you are viewing it (plus you want the text indexed, so screenshot is not an option)? Is PDF the best option in that case too? Is web archive considered chancy because future versions of the Mac OS might display a web archive differently from how it originally appeared?

Bill_DeVille · January 3, 2009, 10:36pm

One can capture a Web page into DEVONthink as PDF, WebArchive, HTML, rich text or plain text.

PDF and WebArchive captures most closely approximate the layout of a Web page. HTML captures the layout, but not images. RTF only approximately does that (sometime well, sometimes badly, depending on the page layout), and plain text captures only text content.

Almost always, my interest is a specific article, perhaps with associated images, on a Web page. I want to “freeze” that information in my database so that, whether or not the source page disappears later, and whether or not I’m online, it is now contained in my database. WebArchive (if the contained images are inline), PDF and RTF captures can do that, but HTML or plain text captures cannot do that.

Another consideration is that once in my database, I may want to extract text from the document, or add links or notes to it. If I made a PDF capture, extracted (copied) text has hard returns at line endings, and I hate that. I cannot link from a PDF document, and notes added are only in plain text and may not be searchable. So PDF won’t be my favored format. Although I can copy text from a WebArchive without problems, I can’t add links or write on it (I’m not about to do source code editing).

Still another consideration is that a PDF or WebArchive capture is likely to contain content that I don’t want or need, such as ads or unrelated text and images.

Only the RTF(D) format survives those considerations for a capture that permanently holds what I’m interested in, allows easy extraction of text that doesn’t have to be edited to remove hard returns at line endings and is easily editable to add my own hyperlinks and notes, and doesn’t contain extraneous content.

So, about 99% of the time, I capture information from Web pages as rich text.

Now that I’ve got rich text, I can do anything I wish with it. I can drop it into various word processors (with or without images), turn it into PDF, Word, HTML, WebArchive or whatever. So this is as “universal” as I need it to be.

kewms · January 3, 2009, 11:30pm

I would say yes. That’s pretty much the application for which PDF was invented. The web wasn’t a factor yet (1993), but people needed a platform-independent way to send documents around so that printers, graphic artists, and layout people could all know they were looking at the same thing. Since then, the format has also added security features, which are important if you need to be able to prove that yes, this really was what you had on the screen. (Although to implement those you need more than a simple viewer.)

Katherine

rmathes23 · January 4, 2009, 1:00am

Bill…great information and parallels my line of thought. I just tried extracting text from a PDF and was not pleased with the experience.

Interesting that others find PDF to closely resemble the page layout. I find the opposite is true, that PDF doesn’t look like the web page much, at all. Neither does RTF. Both seem to do a pretty reasonable job of stripping out ads and other unrelated graphics. If I want the page to look like it did when organically viewed online, I take a web archive. Otherwise, at this point I’m leaning towards RTF.