How to save linked URLs?

theNonsuch · August 29, 2003, 6:22pm

Is there any way to get DA to pull a page, and all of the linked URLS in that page to a certain depth into its archive? I can’t seem to figure out how to do this.

For example: I would like to save a page that has appr. 50 links to other domains. I just want the initial page, and then just a single level from each of the linked URLs, with the pertinent images and CSS files.

How can I accomplish this? Thanks!

cgrunenberg · August 29, 2003, 8:37pm

Although a crawler set containing the initial page in combination with the "Follow Links" option (lowest level) is very similar, this is currently not possible (as DEVONagent scans all pages). This would only work if the links and the destination pages contain some common words.

E.g. assuming all pages contain the word "the" and all links for example the word "Link", then the following procedure should work:

Create a new crawler set
Add the page to that crawler
Enter "the" as the default term
Enter "Link" as the follow links term
Activate the follow links option (lowest level) and deactivate all filters
Press "Crawl"

But this won’t retrieve images or CSS files.

alkan · August 30, 2003, 12:15am

Although I don’t see DA as being primarily a web archiving tool, I agree it would still be useful to be able to archive whole or partial websites. While there’s nothing wrong with using Internet Explorer (which I virtually never use anymore), it’s not ideal for a number of reasons: no services, MS gliches, proprietry WAFF file - which is not, incidentally, compatible with DT (need I say more!). To make DA a central repository for archived web materials - in conjunction with DT - would be ideal!

cgrunenberg · August 30, 2003, 12:28am

Sure, DA has the potential to do much more things, e.g. to become almost a complete browser, a web archive, a site sucker, an interface for databases or mailing list archives (some of the latest plugins not included in 1.0b1) etc. But first we’ll have to finish v1.0 and afterwards see what people really want/need.

theNonsuch · August 30, 2003, 2:13am

I wholeheartedly agree. I’ve been waiting for a while for a really good web page(s) archiving system that displays pages as well as it archives them. With WebCore, this can finally become a reality - hopefully DT will have WebCore baked in soon, too!

cgrunenberg · August 30, 2003, 8:13am

This will probably be added to v1.8 (and then we’ll drop OS X 10.1.x support).

dhbwell · September 1, 2003, 1:28am

Please add my strong vote for web page/site archiving capabilities. For me, this is the biggest hassle. There’s no quick, elegant, searchable solution for Macs that archive the web pages in an integrated way with other non-web data.

My work requries me to do fast web research, grab a lot of stuff quickly in case I need it later, and then when writing documents, access the info really fast through robust searches.

I often need to archive entire sites, or numerous pages from a site because my research takes me to sites that aren’t well maintained and can’t be relied upon to be there next time I need the info

DT seems to be the best for text and rtf. And DA is promising to be a great way to search the web, and to archive search results. But what the world needs-what I need-is an integrated approach for all of that and archiving entire pages, pages to a specified depth, and entire sites.

Until DA was released, Acrobat 6 was the best for archiving web pages or entire sites. You have to put up with the page breaks of a print oriented application, but they’re always clean breaks. Links remain active. The original URL is saved in the footer. And, if you place all the web grabs in a common folder, you can build an index in Acrobat 6 that makes searches lightening fast. But, you have to rebuild the index whenever the folder changes. The results are displayed in a reasonable convenient way. And Acrobat is cross platform. Archiving in Acrobat is not likely to lock one’s data into a file format that might be only a memory in a few years.

But, it’s a kludgy process. I’ve automated it somewhat with a Quickeys shortcut.

DA/DT has the potential to be an elegant, seamless solution. And I’m going to support the cause with my registration in the next few hours even though the product is still in beta.

Archiving from DA>DT is one-click simple if all you want it one web page at a time. But for the moment, graphics that show up in DA often don’t show up in DT, and once in DT, links don’t work. At least some JavaScript links didn’t work today. Also, the preview text is too small, displays in an undesireable font, and not displayed like it is in Safari.

In my mind, one of the biggest services DT could perform would be to simplify, automate, and expand the choices the user has to conform all the incoming text, that is invariably different font styles, sizes colors, and formats, into a user defined syle sheet. In a manner of speaking, DT would have a built in textSOAP function that deletes multiple spaces, insures there’s two carraige returns between paragraphs, and changes the font to a user-prefered font style. And allows me to set that as a import default, and allows me to decided on a case by case basis. I don’t have the processing pwoer between the ears to deal with complex content AND a complex array of font styles too.

Sorry for the length. Consider it a sign of a supportive user who is rooting for your guys in a big way.

Fred · September 5, 2003, 11:17pm

This is a good suggestion, well stated. Unlike some of the features suggestions that have been offered here (including some by me, no doubt), a feature set that "rationalizes" incoming text seems to be right in line with the goals/purpose of DEVONthink as I understand them. Some of the ideas presented in the quote above are already available "manually" through Word Services. Futher integration and development of these features could lead to the creation of a "default style sheet" for incoming text. This seems very desirable to me.

fwiw.

bradjg · September 7, 2003, 4:29pm

Just for vote-tallying purposes, I also agree with the style sheet idea.