Web archives of blog pages noisy


when I do research on a topic I typically surf the web and take lots of web archives. Many of the pages I archive are blog pages, where there is lots of “noise” (advertisment, history, …) around the main article, I am interested in.

Because of the noise I get pretty low quality on the “categorize” and “see also”-functions.

Is there any best practice how you should handle this? How do you guys handle this?


You could either edit the web archives and delete all unnecessary parts (DEVONthink Pro only) but personally I prefer to take rich notes of the interesting parts.

I will try with taking rich notes then. Hmm, but I guess you can’t refer to the original web page later cause you miss the URL. But nevertheless - I try.


If you use drag-n-drop or the Services menu for DT it will try to capture the URL as well. Ditto for the scripts in the Script menu (on the right hand-side of the menu bar if you’ve enabled this).

I do the vast majority of captures from the Web as rich text, which includes selected text, tables and images.

When doing this from Safari or other Cocoa Web browser using the shortcut Command-) --Shift-Command-0 – the URL is automatically included in the new document’s Info panel.

Rich text capture of selected material is also available via a contextual menu option in DEVONagent and in the built-in DEVONthink browser.

But you cannot capture rich text if Firefox is the browser.

Tip: To select a long article, place the cursor at the bottom of the material, press the Shift key and “swoop” the cursor upwards.

Ahh! Great. I used Firefox as my primary browser so far but this is a good reason to change back to Safari.


I may be taking this to an extreme, but here is how I handle this:

  1. I use Aardvark [karmatics.com/aardvark/] to cut away all the parts of the page that I don’t want.
  2. I then use the make.text bookmarklet [homepage.mac.com/tjim/] to convert the html into markdown [daringfireball.net/projects/markdown/]

I then import the notes as plain text. You don’t get the URL passed to the URL field this way, however the URL of the page is preserved in the markdown. This prevents Devonthink from updating with the noisier version. Naturally, you lose all images this way but I don’t work with images so it doesn’t really affect me.