There are several filetype options for capturing data from a Web page.
-
A bookmark is the most minimal, capturing only the URL of the page.
-
A ‘page’ or HTML capture saves the source code of the page to the database, but images can only be displayed by downloading from the Web each time the page is displayed in the database.
-
A WebArchive capture is like an HTML capture that includes the page images.
-
A rich text capture of selected text/images can avoid capture of unwanted areas of the page to the database and will include hyperlinks. (A plain text capture will not include formatting, images or links.)
-
A PDF capture ‘freezes’ the Web page as a non-paginated or paginated PDF.
All of these options except the bookmark capture the text content of the page (or a selected area of it), capture hyperlinks if present and preserve the text information for searches in DEVONthink.
SOME PAGES CHANGE FREQUENTLY, SOME DO NOT
Some Web pages are relatively static and may persist unchanged for a long time, such as a news or journal article. Others change frequently, such as the home page of an online newspaper or journal.
I’ve got a collection of hundreds of bookmarks to Web sites that I find useful, including governmental agencies, journals, news sites, etc. There’s no point in indexing these pages, because they change frequently.
WHAT’S "INTERESTING’ ABOUT A PAGE CAN INFLUENCE THE CAPTURE CHOICE
Suppose I find a page interesting enough that it has a place in a DEVONthink database, so that its text content becomes indexed and is searchable.
A capture as HTML would capture the text and links, but if I’m offline I won’t be able to see the images, and if the Web page containing the images were to change in the future, I might not be able to those images later, even when online. Although HTML captures are compact in file size, I rarely use them.
A Web site designer might be interested in documenting examples of page layouts. Appropriate choices for capturing pages would then be as WebArchive or non-paginated PDF. Such captures will generally retain the appearance of the page.
My interests center on the information content of a page, and I have little or no interest in the page layout or appearance. I want to capture the text, images, tables and hyperlinks contained in an article, while excluding ads and other page content not related to the article. My favored capture mode is as rich text of a selected area of the page. But suppose I wish to capture a thread in our user forum that contains a script in one of those little ‘Select All’ boxes? In that case I would probably select a portion of that page and capture it as a WebArchive, which would allow me to pull out that script text.
DEVONTHINK’S ARSENAL OF CAPTURE TOOLS
Services: Services provide useful interoperability between Cocoa-based applications such as Safari, DEVONagent and DEVONthink. If I select all or a portion of a Web page displayed in a Cocoa browser (including DEVONthink’s built-in browser) I can invoke a keyboard shortcut “Command-)” to capture the selection as rich text, or “Command-%” to capture the selection as a WebArchive. There are a number of other Services provided by DEVONthink as well, many of which are available as contextual menu items in DEVONthink’s browser. DEVONthink Services are also available in many other Mac applications.
Unfortunately, there are still a number of applications for Mac that don’t recognize Services, such as MS Office applications and Firefox. I find Services so useful to my workflows that I avoid using such applications, especially when I’m capturing content to DEVONthink.
Scripts: DEVONthink Pro and DEVONthink Pro Office have large AppleScript dictionaries and allow use of scripts to extend features and automate actions. With an application frontmost, such as Safari or even Firefox, take a look in the menubar in the Scripts menu. You may find an available script that will allow capture of content to your database.
Bookmarklets: On the DEVONtechnologies Download page there’s an ‘Extras’ link that leads one to a collection of Bookmarklets that provide a number of options for capture of content to a database. Simply drag the desired Bookmarklets into your browser’s Bookmarks Bar.
The Finder’s Inbox ‘Place’: DEVONthink Pro and DEVONthink Pro Office create a Finder folder that’s shown as “Inbox” under Places in the left column of Finder windows.
This provides a versatile way to send new content from a wide range of applications directly to the Global Inbox database, using the application’s Save or Save As command. For example, if one is viewing a Web page in Safari or Firefox, File > Save As will save a WebArchive of that page to the Global Inbox. If the displayed page is a PDF, it will be saved to the Global Inbox as a PDF. Likewise, one can create a new sheet in Excel or a new document in Pages and use File > Save to the Inbox to save that file directly to the Global Inbox.
SOME POSSIBLE SURPRISES
There are a number of Web sites that one might visit that prohibit use of some of the capture options described above. A secure banking site or a university portal to journals may direct a Bookmarklet or script capture option to the login page, which will be captured instead of the desired content. That’s because the site prohibits dual access to the viewed page. A capture option that requires re-download of the page on such sites will capture only the login page.
Always, the page can be selected and captured as rich text (from Safari) or as plain text (from Firefox). The viewed page can be captured by ‘printing’ it as PDF to the database. When on my bank’s online site, to record a transaction I press “Command-P” to invoke the Print panel, click on the PDF button and then select the script to Save to DEVONthink as PDF to my database.