Batch Capturing of Web Pages

JuhaK · December 19, 2006, 9:31pm

I just discovered DEVONthink Pro and consider it to be an absolutely awsome piece of software and now would like to migrate my previous solutions (VoodooPad, OmniOutliner as well as my bookmark & document archives) into DTP.

I have quite a big file structure, where I, over several years, have saved a plethora of weblinks (.url, .webloc) but also other documents (.pdf, xls, .doc, .htm, .html, among others), which are sorted in the folder hierarchy according to my own taxonomy. The total amount of links and documents is more than 20.000.

What I now would like to do is:

import the entire folder structure to DTP (already done that, in fact - easy as a breeze…)
let DTP automatically load every web page, to which a webloc/url documents points and save the page in the SAME LOCATION, where the referring webloc/url document is located and index the loaded documents
let mark (lable? comment?) the webloc/url documents according to the download success (ok/404/timeout, etc.), in order to be able to delete or check them manually

and later, as another workflow:

drag&drop further folder structures containing unsorted webloc/url documents to DEVONthink Pro and let DEVONthink Pro load the corresponding web pages (see “2)”) and after having analyzed the documents, insert them automagically in the existing taxonomy in the (“best guess”) appropriate places.

Can the steps 2), 3) and 4) be automated per script, somehow? Doing the task manually, folder by folder would be very cumbersome and, if I understand the program correctly, DTP actually seems to contain all components, which would be required in order to automate the described task.

Should this scenario already have been discussed in the forum, FAQ:s or elsewhere, could you please post me the link. At least after a brief search, I haven’t been able to find any corresponding info so far.

Thank you

Juha Krapinoja

cgrunenberg · December 20, 2006, 9:18am

This thread (http://www.devon-technologies.com/phpBB2/viewtopic.php?t=2475) contains a basic script to convert links to web archives storing them in the same parent group.
You could use the command Data > Auto Classify for this task.

JuhaK · December 21, 2006, 7:55pm

Thank you Christian.

As I am not a script programmer, I - unfortunately - will not be able to modify the script in the required manner (recursive working through the subfolders marking of the url file depending on capture results, for example).

Would any scripter perhaps “by coincidence” be interested in trying to code this…?

cgrunenberg · December 22, 2006, 2:24pm

A recursive solution is possible but it’s not possible to “mark” the bookmark.

JuhaK · December 22, 2006, 8:40pm

Do you mean that for example the label could not be set depending on the result of the capture?

I assume that basically either the labels could be set per script(?) or some text could be appended to the comments per script(?) but finding the trigger to do these actions would be the core of the problem. Is this assumption correct?

I would, however, already consider it to be quite helpful if the URLs, for which a capture attempt has been made, would all be marked in either way (user selectable?) in order to be able to skip them in the future.

Thanks for your reply and happy holidays