Google Blog Search Plugin Issue

ajcowell · November 26, 2007, 8:06pm

Hi, I’m trying to harvest blog entries from certain websites, via Google Blog Search. So, lets say I want to harvest all the blog entries from Engadget. The following GBS url seems to do a decent job:

blogsearch.google.com/blogsearch?as_q=
&num=100
&hl=en
&ctz=480
&c2coff=1
&as_epq=
&as_oq=
&as_eq=
&as_drrb=q
&as_qdr=a
&as_mind=1
&as_minm=1
&as_miny=2000
&as_maxd=26
&as_maxm=11
&as_maxy=2007
&lr=lang_en
&safe=active
&q=blogurl:www.engadget.com
&ie=UTF-8
&scoring=d
&filter=0

(broken down to each of the elements so you can see what I’m sending, full line below):

blogsearch.google.com/blogsearch … &scoring=d

Running this in my browser gives me the first 100 results out of 28,725, arranged in date order.

Here is the plugin I wrote:

<?xml version="1.0" encoding="UTF-8"?> Description Google Blog Search Search EncodingUrl UTF-8 EngineUrl http://blogsearch.google.com/blogsearch?as_q=_agentQuery_&num=_agentNumber_&hl=en&ctz=480&c2coff=1&btnG=Search+Blogs&as_epq=&as_oq=&as_eq=&bl_pt=&bl_bt=&bl_url=www.engadget.com&bl_auth=&as_drrb=q&as_qdr=a&as_mind=1&as_minm=1&as_miny=2000&as_maxd=26&as_maxm=11&as_maxy=2007&lr=lang_en&safe=active&scoring=d Identifier gov.pnl.google_blog.plugin Info Google Blog Search Plugin 1.0 ©2007 Pacific Northwest National Laboratory Keyword googleblogsearch LinksStart  LinksEnd  LinksNotMatching *.google.* *search?q=cache:* Name Google Blog Search OffsetPerPage 100 ResultsPerPage 100 Start 0 Version 1.0

Everything seems to work pretty well, although I can’t seem to work out how to get it to go beyond the first results page. Based on little changes here and there I get between 50 and 90 results, but the search only ever hits aroun 103 pages. Looking at GBS, they only seem to provide up to the first 1000 hits, which would be fine (you can use date ranges to get them all in the end).

Is there something pretty simple I’m missing here? I read a little about LinksStart and LinksEnd but didn’t really understand. I used the Google plugin .plist as a template for this so I was hoping that most of the mechanics would just work out. Thanks for any help you can provide.

Andrew.

ajcowell · November 26, 2007, 8:18pm

Not sure if this is a solution or just a quick fix, but it looks like if I increase the number of links being followed (previously 0) that I do get more of the results. Not sure if this is due to linkages from the pages identified in the first 100 search results or not. I was hoping to avoid that as, of course, blogs link to other pages and I just want to end up with all the blog entries from the site in question (e.g., Engadget).

I can filter out a lot of them in post processing (does the site url contain engadget.com) but if there is a simple fix to the question above, I would certainly appreciate it!

annard · November 26, 2007, 8:59pm

In the online help there is a chapter on writing your own plugin. There it also explains how to go beyond the first page of search results. Search for the keyword “EngineUrl” in the online help for DEVONagent. Please read that and then the following paragraph should make more sense.

The trick is to compare the URL of the first results page with the URL of the (say 3rd) results page. If these are completely different you can use the “EngineNextUrl” keyword, otherwise it normally only takes 1 parameter that you can catch with the “agentOffset” variable.

ajcowell · November 26, 2007, 10:01pm

Awesome, I knew there was something missing. I remember reading the plugin chapter a little while ago. I knew I had the OffsetPerPage in there, but forgot to include it in the EngineUrl.

Thanks! Maybe consider including this in the blog section, I seem to get much better results with this plugin than the other three combined.

ajcowell · November 27, 2007, 12:22am

PS: One potential reason why DT might choose not to include such a plugin is due to Google keeping a close check on hits to their servers. After using the plugin for a while I start to get ‘hey, you seem to be doing something automated, you probably shouldn’t do that’ messages. I think at one time they used to have something in their terms of service about using automated means to capture search results. So, if anyone decides to use this, they might want to use it sparingly.

kewms · November 27, 2007, 7:37am

If you use automated means to capture search results, then you could at least theoretically set up your own search service, piggybacking on Google’s huge infrastructure investments while siphoning off the resulting ad revenue. So yes, I can see how they might become annoyed.

OTOH, they do offer a variety of tools for people who want to build custom searches, either for personal use or for web sites. So they’re not completely opposed, you just need to play by the rules.

Katherine

PS Full disclosure: My husband works for Google. But he’s asleep right now, and even if I asked he would send me to the public information pages.