Metadata scraper

Ajay · January 9, 2009, 4:13pm

I would love to have DA scrape metadata from a page that interested me.

For instance, here is a science news story from BBC News (which I already have in a DTPro database). tags in the header contain (among other things) the publication date, the headline, the content type (the fact that it’s a story), and a description of the story. Two tags contain the reporter’s byline, and a at the bottom of the story contains the reporter’s email address. I would love to be able to capture all that metadata during download and have it mapped automatically to Dublin Core and FOAF elements.

I imagine I could script the extraction of this metadata myself after download (I’m already looking at how to do this to my existing DTPro 1.5.4 records) but it would be more efficient to scrape the page at download. Also in the interest of efficiency, scraping wouldn’t have to happen during the initial search, but could be an option after I’d read a page summary in DA, maybe even a background process for pages in the archive.

There. It’s a big request, but I believe it’s going to become more important as interest grows in repurposing data for semantic publishing.

Cheers.

Trillium · October 15, 2010, 7:04pm

I second that request. Many pages have embedded metadata, often in standard formats like Dublin Core, etc, that could be useful in tagging.

Here is the DC spec for tagging web pages…
dublincore.org/documents/dc-html/

(scientific sites often do use the metadata properly.)

If a page lacks the metatags, perhaps dates on page text could also be recognized and a popup could ask if you want to use them as the page metadata.

Web servers also send a last modified date.