Creating an Offline Archive of a Web Site

whshep · August 29, 2005, 10:26pm

Can anyone explain how to create an offline archive of a web site in DT Pro?

I have tried again and again to make an archive of this site:

socrates.clarke.edu/index.htm

Usually what I get is one file (index.htm), or about 172 graphics files. I certainly don’t get any content I can access while offline.

Further, the option settings make no sense, and/or they keep changing with no apparent rhyme or reason (i.e., I’ll set the options, try an unsuccessful download, then try another download, only to find the settings have inexplicably changed). Turning off the “automatic” setting has no effect, either–it just proceeds with yet another fruitless download.

The help files are no help here–they seem to assume that the process will work without a hitch, and that the options and settings are transparent.

As far as I can tell, this feature doesn’t work at all. Why isn’t anyone else complaining? What am I missing?

Ward_Smith · August 30, 2005, 3:11am

I’ve had hit or miss experiences with “Import Website” too.

I thought it actually captured a page and linked sites, and sometimes it seems to do so.

But if I “capture page” on an imported site page, it actually is subtly different. It uses the title metatag instead of the actual html file name, etc.

(unfortunately) linked css files and off site images in the page don’t capture (I guess with Import site, I could just keep the css file in the archive? But I like the apparent self-containment of the captured page).

Also, say I have opened a second-level imported page and click on the link that will take me to the table of contents – If I just Imported the site, it will either open up Safari(wrong!) and find the page on the internet – where I think it should open up the captured site page. Of course, even if I have imported a site, sometimes, it will tell me that “i’m not connected to the internet,” which most of the time I know, that’s why I imported the site(!).

This linking problem goes away if I right click on an imported page and “Capture Page”. then it behaves right. But If i understand the process correctly, importing was supposed to capture it – but these kind of hiccups remain, complicating my workflow.

Of course, I could just select all the imported site pages and right-click on them and choose “Capture page”, but

A) This option, which is available on the individual pages, is mysteriously unavailable on multiple page selections, tho in most cases, the options are identical.

and B) Capturing creates a duplicate of the file, which means more browsing and deleting.

All that being said, I am pretty optimistic about DT Pro and lookforward to further improvements – onward toward perfection!

lyzorg · August 30, 2005, 10:18am

Do you have also DA? It´s really simple with this program to archive a website for offline viewing into DT.

Just open the desired website in your favourite web browser and click on the Favicon (the blue jelly bullet previous the http://) in the address line, go in the menu [Browser name] > Services > DevonAgent > open URL and when the page has loaded in DA just click Comman-I. That´s it! Now you have a web archive of the desired website in DT.

In Safari Comman-I opens the website in Mail, it would be great if there was a possibility to hack this into “Add web archive to DevonThink”…[/img]

lyzorg · August 30, 2005, 10:26am

An other option is to open the website in Safari, Comman-S (Save as) and save the website as web archive. Then you can import this by drag-n-drop to DT. But this is still cumbersome. It would be great if you cloud add a site as web archive by just entering one shortcut from Safari, like Comman-I…

whshep · August 30, 2005, 4:03pm

Well, maybe I am missing something, but I thought the whole point of the “Download Manager” was that it could download not just the index page but the links–the rest of the site, including all of the content that is linked from the index. In other words, I thought the “Download Manager” was a site-sucker.

All the replies so far seem to be about copying a single page, which is not really a problem.

Am I wrong about DT’s intend to sucker entire sites? If not, can someone who has been successful in doing this give us a tutorial about the proper options, settings, etc.

lyzorg · August 30, 2005, 4:16pm

1.) Go to Windows > Download Manager
2.) Enter the Action menu in the right bottom corner
3.) select “Options” (the last entry)
4.) There you can choose to follow all links from the same host, the same folder, subfolder or up to 2 levels

Of course this does only work correctly with static html websites. Dynamic ones with PHP and MySQL don´t work.

Does this match your inquiry?

(I use a German version of DT, so maybe the menu entrys are not labeled correctly in my submission)

howarth · August 30, 2005, 4:26pm

As one who stores mostly URLs, I’m curious about why others want offline copies of sites. If a site updates tomorrow, the offline copy is old data. I’m guessing that a slow connection must be the main reason. Or else there’s a concern that the site will soon vanish?

lyzorg · August 30, 2005, 4:44pm

I use DT to collect material for ma Master thesis and therefor I need offline copies of Web documents for citation and excerpts. I also need the exact date and time of retrieval so I can proof, the document looked like this when I retrieved it.

I´m more curious, why people want to store whole websites in DT. If you read the Blog by Steven Johnson stevenberlinjohnson.com/mova … 00230.html you understand why it doesn´t make much sense to store too voluminous documents because: “I don’t want the software to tell me that an entire book is related to my query. I want the software to tell me that these five separate paragraphs from this book are relevant. Until the tools can break out those smaller units on their own, I’ll still be assembling my research library by hand in DevonThink.” I think this is analogue to whole websites. The AI/semantic search functions of DT dont´t work when someone stores whole websites.

What does DEVON say about this issue?

Ward_Smith · August 30, 2005, 5:13pm

The problems I described above still remain. Also, behaviour described in another post – the randomness of the options, how they don’t stick when you set them in the Import site dialog – remains.

Ward_Smith · August 30, 2005, 5:17pm

Well in my case the main reason to do so would be because I want to.

ok, jokes aside. Why buy books, when they are all in the library? I download alot of textbooks, and things like the original text of Frankenstein don’t change a whole lot, for instance.

It’s useful to me to have these texts available wherever I go, and pervasive wireless isn’t really there yet, and neither am I.

Also, I want to edit the html and write in a named anchor for each paragraph in the text. This way, I can navigate with a high resolution within a captured html page, in a devonthink way.

And I don’t think the site managers where the texts I have sucked down are hosted would let me do that.

hardcat · September 4, 2005, 3:13pm

Hello,

After a great deal of frustration at trying to achieve off line browsing that works, I have chosen to use Web Devil (WD) with DTP. It is simple and works. WD is a site sucker that downloads web sites to my iBook for off line viewing. I have created a folder in Documents called Web Devil Sites. I open a new session in WD, enter the target URL and click start. WD creates a new site folder in Documents/Web Devil Sites. Each site I download is created in it’s own folder for easy management. Once a site is downloaded, I open the folder it is in, find the Index file and drag it to DTP. I now have offline browsing coupled with the power of DTP. I can also click on the index file to have the site displayed in a browser.

For those that need to have all of a sites data on a site, this I believe is the most effective solution. It certainly works for me…And no more frustration!!! Of course there is additional cost involved. WD costs just $ 34.95. The site at the top of this post downloaded to my site quickly and without error. If you are interested in WD you can find it here:

chaoticsoftware.com/ProductP … Devil.html

Or there is this:

maxprog.com/WebDumper.html

hardcat

Bill_DeVille · September 4, 2005, 6:50pm

To a considerable extent, there will be problematic aspects of download manager and of other site suckers, until one can specifically ‘tell’ the process which links to follow, and which to exclude. If ‘follow link’ depth is too shallow, not all the desired pages will be captured. But, as many find out, if ‘follow link’ depth is too deep, it’s easy to run out of hard disk space.

As noted by others, download manager and other site suckers work with static Web sites, but not with dynamic ones. And of course the design and layout chosen by the site’s designer often impacts the scope and appearance of a site capture. Personally, I make very little use of Download Manager or any other page sucker. I can recall only one totally successful site download, i.e., one that brought in all of the wanted pages and didn’t bring in extraneous material. I sent a fervent letter of thanks to the site administrator. Some months later, I tried to download that site to check for changes. They were using a new site design and the download was useless. So it goes.

Most of the time, I do single-page downloads. Sometimes I do want to capture for reference purposes the exact state of a page, and now we can do a Web Archive download.

If I want to capture a page together with a lot of related links I use Acrobat’s Web page capture feature, as it gives me total control of the links that I want to capture (and the ability to discard pages that turn out to be unrelated, after download).

Example: I’ve got a PDF document that is over a thousand pages long. It contains technical references of the type that often change over time, e.g., laboratory procedures that may be revised. Each page of my downloaded PDF references the date and time at which the capture was made, which can be important for quality control purposes. Acrobat is smart enough to ‘internalize’ the links in the PDF file. That is, if I had captured a page that is referenced by another page I’ve captured, the link is followed in the PDF itself, not out to the Web. Note that there’s a lot of user decisions and some time invested in this approach. I went even further; because the actual page order was quite random rather than in linear topical order, I added Bookmarks to the document to make it very easy for topical navigation by users. That also took time.

In this case, I was providing reference materials intended for graduate student training in developing countries, and needed to assure that the included content was ‘on target’ and without irrelevant and potentially distracting or frustrating content. Unfortunately, at this time neither the Download Manager in DT/DT Pro nor any other site sucker software can produce results with the specificity and quality control of the example above. Maybe some time in the future?

Note: Johnson’s article in the New York Times has stimulated a lot of discussion among DT/DT Pro users about the ‘ideal’ size of documents in a DT/DT Pro database, especially for See Also purposes. I don’t ‘split’ large documents, but find that See Also still works pretty well for me. That 1,000+ page manual of lab procedures sits in my database. If I’m looking for analytical procedures for organic mercury compounds in fish tissue, it pops up – but it does not ‘dominate’ the list of suggestions, because relatively small portions of the big document are specific to that interest. And it’s a very useful reference, of course. Tip: Start See Also from a small document or selection, rather than from a big, diverse document.

whshep · September 11, 2005, 11:49pm

Of course, my original question was, Has anyone actually gotten this to work? It seems not, and DT fails miserably (i.e., sucks) at site-sucking.

frmoses · September 12, 2005, 12:40am

The interface is confusing, and definitely needs some reworking. But, I recently needed to archive 2 sites that I wanted frequent quick reference to (due to my slow dial-up connection, if I had broadband I wouldn’t have bothered). I got it work beautifully. I may have lucked out in that these sites were formatted well for this.

As for the settings in the Download Manager, I found that simply choosing “Subdirectory (complete)” from the Action menu worked to grab the whole site. If you want the site to go into your DT database then also select “Import files to database” – otherwise it will go to a folder elsewhere on your hard drive (which you can select).

JTW · October 11, 2005, 6:48pm

Hmm, I think this is a very important thread. One of the main reasons I’ve invested in DT/Pro is to have a simple way of achieving an off-line archive that I can search and browse. For example I’d like to have an off-line archive of the RubyRails Wiki, so that I can refer to it whilst travelling.

I too have had very mixed experiences using the import facility and/or the download manager. Perhaps it does all work, but we just need better documentation or examples?

One issue I can’t find a way around is the download manager seems to put the imported documents into the top level of the DB. I’d like to be able to specify a folder to hold the imports.

Also, once an import has completed it’s difficult to understand which links have also been sucked into the DB as they still appear with their real URL.

Any thoughts?

milhouse · October 11, 2005, 6:58pm

I like the capability to capture entire sites as well (for reference… some materials won’t change much especially academic papers etc).

IMO, this functionality is somewhat confusing and seemingly not well documented in DT.

I’ve experimented with it a bit but can’t seem to get it to produce something simple with the appropriate links/depth… so I basically gave up on that feature.

If I missed something in the tutorial, I apologize.

On another note, many of the great features of DT require somewhat of a learning curve and will turn away more than a few potential users. They are great products and I’d love to see the company prosper.

cheers

rmathes23 · October 17, 2005, 3:29am

not that I wish frustration on anyone else, but it’s good to see I’m not the only one who can’t get this to work. At this point, this aspect of DT Pro seems absolutely useless. I tried following the instructions to download the devon tech home page, including links. I got the index page and that was it. I selected ‘offline archive’, then file/import site.

I’ve tried this on other sites. If I did something wrong, I’d love to know about it. I’d love a way to download a web page and go either one or 2 links deep and exclude images/movies from the download.

Honestly, if all this feature does is download the top level page specified, then it would appear this is an elaborate and time consuming way to do what I could already do in the previous personal edition by copying the URL into DT and hitting the ‘capture’ button.

This was one of the features I paid money for when upgrading to DT Pro, but so far either I’m being dense or this thing flat out doesn’t work.

Would be nice to get a comment from the developers on a topic so many appear to be interested in and frustrated with.