Newbie Q about Saving Website Info

cheezenip · May 10, 2010, 3:56pm

Hi,

Newbie quesiton here. I’m saving webpages with great info in DT so I can access them later with keyword searches in the article (as oppose to using tags). I’ve noticed that when I access these files later DT links directly to the original website link.

If that original website link is deleted in the future by the original website administrator, will I still be able to access that article I saved in DT? If not, will I need to save the website info into another format before saving it to DT to make it searchable?

Bill_DeVille · May 10, 2010, 7:27pm

There are several filetype options for capturing data from a Web page.

A bookmark is the most minimal, capturing only the URL of the page.
A ‘page’ or HTML capture saves the source code of the page to the database, but images can only be displayed by downloading from the Web each time the page is displayed in the database.
A WebArchive capture is like an HTML capture that includes the page images.
A rich text capture of selected text/images can avoid capture of unwanted areas of the page to the database and will include hyperlinks. (A plain text capture will not include formatting, images or links.)
A PDF capture ‘freezes’ the Web page as a non-paginated or paginated PDF.

All of these options except the bookmark capture the text content of the page (or a selected area of it), capture hyperlinks if present and preserve the text information for searches in DEVONthink.

SOME PAGES CHANGE FREQUENTLY, SOME DO NOT

Some Web pages are relatively static and may persist unchanged for a long time, such as a news or journal article. Others change frequently, such as the home page of an online newspaper or journal.

I’ve got a collection of hundreds of bookmarks to Web sites that I find useful, including governmental agencies, journals, news sites, etc. There’s no point in indexing these pages, because they change frequently.

WHAT’S "INTERESTING’ ABOUT A PAGE CAN INFLUENCE THE CAPTURE CHOICE

Suppose I find a page interesting enough that it has a place in a DEVONthink database, so that its text content becomes indexed and is searchable.

A capture as HTML would capture the text and links, but if I’m offline I won’t be able to see the images, and if the Web page containing the images were to change in the future, I might not be able to those images later, even when online. Although HTML captures are compact in file size, I rarely use them.

A Web site designer might be interested in documenting examples of page layouts. Appropriate choices for capturing pages would then be as WebArchive or non-paginated PDF. Such captures will generally retain the appearance of the page.

My interests center on the information content of a page, and I have little or no interest in the page layout or appearance. I want to capture the text, images, tables and hyperlinks contained in an article, while excluding ads and other page content not related to the article. My favored capture mode is as rich text of a selected area of the page. But suppose I wish to capture a thread in our user forum that contains a script in one of those little ‘Select All’ boxes? In that case I would probably select a portion of that page and capture it as a WebArchive, which would allow me to pull out that script text.

DEVONTHINK’S ARSENAL OF CAPTURE TOOLS

Services: Services provide useful interoperability between Cocoa-based applications such as Safari, DEVONagent and DEVONthink. If I select all or a portion of a Web page displayed in a Cocoa browser (including DEVONthink’s built-in browser) I can invoke a keyboard shortcut “Command-)” to capture the selection as rich text, or “Command-%” to capture the selection as a WebArchive. There are a number of other Services provided by DEVONthink as well, many of which are available as contextual menu items in DEVONthink’s browser. DEVONthink Services are also available in many other Mac applications.

Unfortunately, there are still a number of applications for Mac that don’t recognize Services, such as MS Office applications and Firefox. I find Services so useful to my workflows that I avoid using such applications, especially when I’m capturing content to DEVONthink.

Scripts: DEVONthink Pro and DEVONthink Pro Office have large AppleScript dictionaries and allow use of scripts to extend features and automate actions. With an application frontmost, such as Safari or even Firefox, take a look in the menubar in the Scripts menu. You may find an available script that will allow capture of content to your database.

Bookmarklets: On the DEVONtechnologies Download page there’s an ‘Extras’ link that leads one to a collection of Bookmarklets that provide a number of options for capture of content to a database. Simply drag the desired Bookmarklets into your browser’s Bookmarks Bar.

The Finder’s Inbox ‘Place’: DEVONthink Pro and DEVONthink Pro Office create a Finder folder that’s shown as “Inbox” under Places in the left column of Finder windows.

This provides a versatile way to send new content from a wide range of applications directly to the Global Inbox database, using the application’s Save or Save As command. For example, if one is viewing a Web page in Safari or Firefox, File > Save As will save a WebArchive of that page to the Global Inbox. If the displayed page is a PDF, it will be saved to the Global Inbox as a PDF. Likewise, one can create a new sheet in Excel or a new document in Pages and use File > Save to the Inbox to save that file directly to the Global Inbox.

SOME POSSIBLE SURPRISES

There are a number of Web sites that one might visit that prohibit use of some of the capture options described above. A secure banking site or a university portal to journals may direct a Bookmarklet or script capture option to the login page, which will be captured instead of the desired content. That’s because the site prohibits dual access to the viewed page. A capture option that requires re-download of the page on such sites will capture only the login page.

Always, the page can be selected and captured as rich text (from Safari) or as plain text (from Firefox). The viewed page can be captured by ‘printing’ it as PDF to the database. When on my bank’s online site, to record a transaction I press “Command-P” to invoke the Print panel, click on the PDF button and then select the script to Save to DEVONthink as PDF to my database.

physicistjedi · May 12, 2010, 9:12pm

Actually latest Firefox recognizes Services and can capture into Devonthink as rich text, but it doesn’t pass through the url. Is this something Devon people can take care of?

pj

cgrunenberg · May 13, 2010, 9:32am

Only if Firefox is going to support AppleScript some day (or provide the URL to the service too). DEVONthink is prepared for both approaches

cheezenip · May 27, 2010, 2:35pm

I’ve been out of town for the last few weeks and am now playing catch up. Thank you os much for a thorough answer to my question.

In many ways, it may be best to use something like Yojimbo to save the website, convert it to pdf before saving it to DT P Office to get the best of all worlds of saving text and preserving formatting.

As I become more familiar with the software, I’m sure I’ll find better ways to optimize the power of DT.

cheezenip · May 27, 2010, 2:42pm

Just saw the following on Devon’s blog :

[i]…If you are only interested in the actual information the best option may be to select text and images and drag them from Safari or DEVONagent to your database. This saves the selection including images as rich text document which should be relatively future-proof (as it is a widely used standard) and saves only the data you are interested in, not n copies of the words ‘Home’ and ‘Back’

So, depending on your needs, saving only the interesting parts of a web page can be more efficient than saving the whole page as a web archive or PDF. If you are interesting in the original look of the page, PDF is a future-proof, standard-based option…[/i]