Hiding Elements in Web Archives

Future51 · September 5, 2009, 10:38am

Is there any chance we’ll get support to hide elements (think of the way Adblock Plus hides advertisements) in Web Archives, or some other solution to the problem of web pages, especially news articles, in which there are tons of terms (sections of the paper like: "World, Economics, Weather, etc. etc. etc.) that are unrelated to the topic at hand but numerous enough to potentially cloud the ability of the AI to correctly classify information and make connections?

I’ve tried using the webclipper to capture just RTF versions of webpages but it never, ever gets the actual article text and usually just grabs an advertisement or nothing.

cturner · September 5, 2009, 11:39am

That’s an interesting question. I have similar issues with some large databases of newspaper articles: often my content is on a full-page scan with movie listings, obituaries, whatever…

My approach, which fits in perfectly with my research method, is to highlight my reading in Skim (I’m dealing with scanned, OCRed PDFs) and maintain a separate database with the text exports of my highlighting. I’ve got links back to the original articles, so if I need to go back to the full-text, I can.

This works really well for me because I’m dealing with a smaller subset of my data, and can always go back (in DTPO) and bring in more from the “originals” if it seems worthwhile.

I imagine you could convert your web materials to PDF and use Skim if you aren’t primarily concerned with synchronizing to updated versions on the Internet.

HTH, Charles

KP1 · September 5, 2009, 12:29pm

Have you tried GlimmerBlocker? It works as an http proxy, so it never allows the ads to load. In my experience, it works fine with DT.

I’ve also had some luck with Readability. But most of the time it loads the article and cuts out the pictures (whether they are ads are not).

Good Luck.

Future51 · September 5, 2009, 1:15pm

It’s not so much the ads that are the problem, but instead the various drop down menus, story previews, section headings and whatnot that exist on pretty much every website on the internet that are likely to totally destroy the ability of the AI to classify information due to the large breadth of unrelated content.

Johannes · September 5, 2009, 1:58pm

Did you select the text before? That works fine for me in combination with capture rtf in the service menu.

Another thing: You can edit webarchives in DTP and cut out everything you don’t want. But of course this is a lot of manual work for some web pages.

Johannes

Bill_DeVille · September 5, 2009, 3:19pm

I usually avoid all the extraneous content on a Web page by selecting only the portion of the page that I wish to capture, such as a journal article, and capturing as rich text using the keyboard shortcut Command-) for that Service.

I could also capture just that selected area of the Web page as a WebArchive by using the keyboard shortcut Command-% for that Service.

Note: Under Snow Leopard the user must enable those Services in System Preferences; once that has been done they are available in Cocoa browsers. Snow Leopard users will find Services available as a contextual menu, including an option for Preferences to add or remove available Services.

Many pages offer an alternative to selecting an area to capture only a desired article, by providing a Print version. That’s especially convenient for articles that may be continued in segments on several pages. Then I make my RTF or WebArchive capture of the printer-friendly version of the article.

Future51 · September 5, 2009, 3:54pm

I’d love to use the Services menu, but one of the most popular, and arguably, the most easily extensible browsers (Firefox) doesn’t support Services at this time.

I don’t blame Devontechnologies for that, which is why I’m here looking for solutions.

Thanks for the heads up on Readability. It’s almost what I’m looking for, although it’s not quite perfect.

Bill_DeVille · September 5, 2009, 4:50pm

That’s why I don’t use Firefox when grabbing information from the Web into my databases. Perhaps one of these days the developers of Firefox will join the Mac community and add Services compatibility for interapplication communication.

sjk · September 5, 2009, 6:06pm

Apparently they’ll be supported in 3.6, currently available in nightly downloads:

http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-1.9.2/

sjk · September 5, 2009, 6:33pm

Also, under the Web tab of Preferences you can select a Style Sheet to use.