Importing .html page

I’m fairly new to DevonThink, so forgive me if this has an obvious answer.

I’d like to import a webpage into the View/Edit pane, but I’m not sure how to do it. I drag the icon for the webpage into the View/Edit pane, but all that shows up is the icon itself, not the page. How do I make the actual page show up there?

When you drag a URL into DT, you create a “bookmark” of the web page, which contains zero bytes of text. Selecting that bookmark will open the web page in the DT browser window.

That can be a very useful feature. I’ve created a Bookmarks group, with a number of subgroups such as Scientific Journals, EPA, News, etc. That allows me to build a bookmarks collection of the web sites that I routinely visit.

Now, how to download information from the Internet using these bookmarks?

With a web page open, DT provides contextual menu options to perform captures as Note Captures of selected text, as HTML source (without images) or as a Web Archive (including images viewable offline). Alternatively, you may use (in Safari, DEVONagent or DT/DT Pro) Services > DEVONthink or DEVONthink Pro > Take Rich Note for capturing selected text/images – the keyboard shortcut is “Command-)”.

Note: I do 99+% of information captures as Note Captures, selecting only the text and images that I want to include in my database. Many – if not most – web pages contain extraneous material such as advertisements that I don’t want to download. Many sites, such as the New York Times online site, offer a printable version that eliminates unwanted material.

Note: DEVONthink Pro also provides scripts for various modes of capture of pages from Safari. There’s also a script for ‘printing’ any printable document – including web pages from any browser – as a PDF import to the database. Unfortunately, PDF captures don’t retain hyperlinks in web pages.

Thanks, Bill. I actually just figured out how to do exactly what I’m looking for: drag the icon for the web page into the Documents pane.

The name of the web page shows up in the Documents pane, and the actual webpage shows up in the Edit/View pane.

If by “icon” you mean the URL, you have created a bookmark in the database. Display the Info panel for that selected item (Shift-Command-I). You will see that a bookmark contains no text, and is not searchable for content. To import searchable text, you would have to capture it using one of the procedures in my first post above.

If, however, you mean by “icon” a local HTML file in the Finder (an HTML page saved on your HD), DT Pro will import the text content of that file.

Tip: If you want to drag web page URLs into your database (creating bookmarks), you can use DT Pro’s floating Groups panel. Here’s how:

[1] In DT Pro’s Preferences > General, UNCHECK “Hide ‘Groups’ panel when inactive.”

[2] Move the Groups panel to the right side of your screen, then minimize it to the Dock (click on the yellow button at top left).

[3] While viewing a web page in any browser, click on the icon of the Groups panel to maximize it, then drag the URL (address field) into the desired location on the Groups panel. When finished, minimize the Groups panel back to the Dock so that it will be ready for future use.

Note: The Groups panel is also convenient for copying selected text/images from any application into your database as a text clipping. However, this doesn’t capture metadata (such as the URL of a web page) along with the clipping. You can also import files from the Finder (e.g., a PDF file or Word file) into a desired location in your database in this way. The file(s) will be imported according to your preferences for that file type.

I like to keep my Dock hidden. Instead, I expand the Groups panel to the full depth of my screen and use the green button to do whatever the green button does, moving that smaller window so that only the title bar is visible at the bottom of my screen. Clicking the green button again pops it up.

I tested this by selecting the above paragraph in OmniWeb and dragged it to the Groups folder, drilling down at least two levels to my devonTHINK Tips folder. In this case, at least, the URL was captured.
I have found, though, that on occasion using Take Rich Note from the Services menu will not capture the URL. Not sure if there’s a pattern there or not–maybe I’m just closing the web window to quickly after hitting command-).

Mark

Hi, Mark:

I’ve got Dock preferences set so that the Dock is hidden until I flick the cursor to the bottom of the screen.

In OmniWeb 5.1.3 the Services options for DT Pro only allow plain text captures. So OmniWeb doesn’t appear to be fully compliant with Cocoa/Services standards.

Is a puzzlement!
I use OW 5.1.3 with DT Pro 1.0.2 on OS X (10.3.9) all day. I can access all the Service options for DT Pro, including both Take Rich Note and Append Rich Note using key commands where it’s offered.

Mark

Mark:

Correction. Your are right. I had just installed OmniWeb to check it out, Forgot to logout/login to get Services working properly. :slight_smile:

Is there a way to “Capture Web Archive” for a whole folder of bookmarks, and have the resulting archives remain in the same folder?

I can only figure out how to do the capture on a page-by-page (not batch) basis… and the resulting archive winds up somewhere else entirely (in my Incoming folder).

Seems like an obvious thing to script but (1) the existing scripts don’t seem to cover it, and (2) my AppleScript talents are clearly not up to the task…

Thanks!

Here’s a script converting all selected bookmarks to web archives (storing them in the same parent group):


-- Convert links to archives

tell application "DEVONthink Pro"
	set theSelection to the selection
	if theSelection is not {} then
		try
			activate
			show progress indicator "Downloading..." steps (count of theSelection)
			repeat with theRecord in theSelection
				if type of theRecord is link then
					set theName to name of theRecord
					set theURL to URL of theRecord
					step progress indicator theName
					set theData to download web archive from theURL
					if exists parent 1 of theRecord then
						set theGroup to parent 1 of theRecord
					else
						set theGroup to missing value
					end if
					set theArchive to create record with {name:theName, URL:theURL, type:html} in theGroup
					set data of theArchive to theData
				else
					step progress indicator
				end if
			end repeat
		end try
		hide progress indicator
	end if
end tell

Wow… thank you!!!

Um… I hate to look gift horses in the mouth, etc, but it promptly crashed DTPro when I ran it. Let me know if you’d like the crash report… by email or here. Thanks!

Or you can create your very own crash report by pointing the script at a bookmark to the (now) nonexistent Web page:

http://www.lusora.com/index_console.html

(I went through the folder one by one until I found the crasher.)

Thanks for the bug report, I’ll check this (and v1.1.2 should fix this).

Christian, thanks this script works quite nicely.

What I think would be even better is a preference to index the text content of any URL you dragged into DTP in the first place. although i suppose if you wanted to do that you’d just import a web page as a rich note so perhaps not necessary after all.

what i find myself really wanting is some sort of “meta-” web archive function. Where there may be several pages linked from a single page that you want all in one indexed archive, or local cache.

For instance, this page:

tokohindonesia.com/ensiklope … ndex.shtml

Up at the top there are 6 pages of a bio entry on this guy. It would be heaven if somehow DTP could know to archive / cache all 6 of those pages from the bookmark. No idea how that might be done but…

Right now it’s rather laborious. I create a group called “Surya Paloh” and web archive each page and then drag each of 6 web archive pages to that group.

If for example you could draw a drag box around a group of links in DTP!

For that matter I notice it is not possible to drag the URL icon of a bookmark in DTP to a group, as you could from Safari.

Thanks for hearing these random thoughts.

Actually it’s not that difficult, just have a look at this script:


tell application "DEVONthink Pro"
	set theSelection to the selection
	if theSelection is not {} then
		try
			activate
			show progress indicator "Downloading..." steps -1
			repeat with theRecord in theSelection
				if type of theRecord is link then
					set theName to name of theRecord
					set theURL to URL of theRecord
					step progress indicator theName
					set theData to download web archive from theURL
					if exists parent 1 of theRecord then
						set theGroup to parent 1 of theRecord
					else
						set theGroup to missing value
					end if
					set theArchive to create record with {name:theName, URL:theURL, type:html} in theGroup
					set data of theArchive to theData
					
					set theLinks to {}
					set theHTML to source of theArchive
					repeat with i from 1 to 99
						set theFoundLinks to get links of theHTML base URL theURL containing (i as string)
						if theFoundLinks is {} then exit repeat
						set theLinks to theLinks & theFoundLinks
					end repeat
					
					repeat with theLink in theLinks
						if not (exists record with URL theLink) then
							step progress indicator theLink
							set theData to download web archive from theLink
							set theArchive to create record with {name:"", URL:theLink, type:html} in theGroup
							set data of theArchive to theData
							try
								set theHTML to source of theArchive
								set theName to get title of theHTML
							on error
								set theName to theLink
							end try
							set name of theArchive to theName
						end if
					end repeat
				end if
			end repeat
		end try
		hide progress indicator
	end if
end tell

This script captures the selected bookmark and all additional numbered pages. Unfortunately there’s a bug in DT Pro and this will require the upcoming DT Pro 1.1.2beta5, otherwise it will capture wrong pages too.

I experience a different kind of crash.

If I try to convert a bunch of selected links (all the links are live) DT Pro crashes when trying to capture this particular link:
reidreviews.com/reidreviews/

If I try to capture them one by one (using the script, not the built-in function) the capture works with no crash, including the link above.

The next release will fix this.

How could the script be changed to save the bookmarks as HTML files instead of webarchives?