Is there any chance we’ll get support to hide elements (think of the way Adblock Plus hides advertisements) in Web Archives, or some other solution to the problem of web pages, especially news articles, in which there are tons of terms (sections of the paper like: "World, Economics, Weather, etc. etc. etc.) that are unrelated to the topic at hand but numerous enough to potentially cloud the ability of the AI to correctly classify information and make connections?
I’ve tried using the webclipper to capture just RTF versions of webpages but it never, ever gets the actual article text and usually just grabs an advertisement or nothing.
That’s an interesting question. I have similar issues with some large databases of newspaper articles: often my content is on a full-page scan with movie listings, obituaries, whatever…
My approach, which fits in perfectly with my research method, is to highlight my reading in Skim (I’m dealing with scanned, OCRed PDFs) and maintain a separate database with the text exports of my highlighting. I’ve got links back to the original articles, so if I need to go back to the full-text, I can.
This works really well for me because I’m dealing with a smaller subset of my data, and can always go back (in DTPO) and bring in more from the “originals” if it seems worthwhile.
I imagine you could convert your web materials to PDF and use Skim if you aren’t primarily concerned with synchronizing to updated versions on the Internet.
It’s not so much the ads that are the problem, but instead the various drop down menus, story previews, section headings and whatnot that exist on pretty much every website on the internet that are likely to totally destroy the ability of the AI to classify information due to the large breadth of unrelated content.
I usually avoid all the extraneous content on a Web page by selecting only the portion of the page that I wish to capture, such as a journal article, and capturing as rich text using the keyboard shortcut Command-) for that Service.
I could also capture just that selected area of the Web page as a WebArchive by using the keyboard shortcut Command-% for that Service.
Note: Under Snow Leopard the user must enable those Services in System Preferences; once that has been done they are available in Cocoa browsers. Snow Leopard users will find Services available as a contextual menu, including an option for Preferences to add or remove available Services.
Many pages offer an alternative to selecting an area to capture only a desired article, by providing a Print version. That’s especially convenient for articles that may be continued in segments on several pages. Then I make my RTF or WebArchive capture of the printer-friendly version of the article.
That’s why I don’t use Firefox when grabbing information from the Web into my databases. Perhaps one of these days the developers of Firefox will join the Mac community and add Services compatibility for interapplication communication.