Trying to Write First Plugin for Google Federal Case Law

AZPatentAttorney · January 18, 2010, 1:04pm

Hi,

I’m new to DevonAgent (and DevonThink too). I’m partly reporting my experience so far, and, partly requesting help. Here is the code I have for my Google Scholar for Federal Case Law:

Don’t use this, go to the end for a better version

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Description</key>
	<string>Google Federal Case law</string>
	<key>EncodingUrl</key>
	<string>UTF-8</string>
	<key>EngineUrl</key>
	<string>http://scholar.google.com/scholar?start=_agentOffset_&amp;q=_agentQuery_&amp;hl=en&amp;as_sdt=2003&amp;as_vis=1</string>
	<key>Identifier</key>
	<string>com.devon-technologies.GoogleFederalCaseLaw.plugin</string>
	<key>Info</key>
	<string>Google Federal Case law</string>
	<key>LinksEnd</key>
	<string>&amp;lt;font size=-1&amp;gt;Result Page: &amp;lt;/font&amp;gt;</string>
	<key>LinksMatching</key>
	<array>
		<string>*/scholar_case*</string>
		<string>*/scholar?cites*</string>
		<string>*/scholar?q=related*</string>
	</array>
	<key>LinksNotMatching</key>
	<array>
		<string>*images.google.com/images*</string>
		<string>*video.google.com/videosearch*</string>
		<string>*news.google.com/news*</string>
		<string>*google.com/products*</string>
		<string>*mail.google.com/mail*</string>
		<string>*www.google.com*</string>
		<string>*books.google.com/books*</string>
		<string>*translate.google.com*</string>
		<string>*youtube.com*</string>
		<string>*docs.google.com*</string>
		<string>*sites.google.com*</string>
		<string>*groups.google.com*</string>
	</array>
	<key>LinksStart</key>
	<string>&amp;lt;font size=-1&amp;gt;Results &amp;lt;b&amp;gt;</string>
	<key>Name</key>
	<string>Google Federal Case Law</string>
	<key>OffsetPerPage</key>
	<integer>10</integer>
	<key>ResultsPerPage</key>
	<integer>10</integer>
	<key>Start</key>
	<integer>1</integer>
	<key>Version</key>
	<string>1.0</string>
</dict>
</plist>

This code returns results. The Log tab shows many errors. I have defined LinksStart, LinksEnd, LinksMatching, and LinksNotMatching. I’m just unsure if it is working properly.

The digest tab always shows certain topics, even though I have excluded the links: Youtube, Photos, Maps, Images, Groups, Gmail, Blogs, Translate, Reader, Calendar, Books, Video, Shopping, Finance, Sign, Documents, Scholar. I experimented with the TitleStart, TitleEnd, TextStart, and TextEnd keys, but, no help.

Also, while developing this, Google eventually set up a page that said, in essence, that I had made too many automated queries. Is it normal to get such a message when working with a tool like this?

Thanks in advance for any help.

cgrunenberg · January 19, 2010, 10:00am

You should exclude the cites/about/related/cluster links too, otherwise 4 links per result are queried…

…and therefore the plugin downloads up to 400 pages from Google.

Using TextStart/TextEnd usually improves the digest & the topics. To exlude certain words, use the NoTopics key in the plugin (this will be simplified by v2.5).

AZPatentAttorney · January 23, 2010, 1:00pm

Ok. This should reflect your suggestions.

If you try to copy and paste this, you need to convert the > and < to the single character equivelents whenver you find them inside the xml tags, for example, DateStart, DateEnd, LinksStart, LinksEnd, TextStart, TextEnd, TitleStart, TitleEnd

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>DateEnd</key>
	<string>&lt;/center&gt;

&lt;p&gt;</string>
	<key>DateStart</key>
	<string>&lt;/b&gt;&lt;/p&gt;&lt;/center&gt;

&lt;center&gt;</string>
	<key>Description</key>
	<string>Google Federal Case law</string>
	<key>EncodingUrl</key>
	<string>UTF-8</string>
	<key>EngineUrl</key>
	<string>http://scholar.google.com/scholar?start=_agentOffset_&amp;q=_agentQuery_&amp;hl=en&amp;as_sdt=2003&amp;as_vis=1</string>
	<key>Identifier</key>
	<string>com.devon-technologies.GoogleFederalCaseLaw.plugin</string>
	<key>Info</key>
	<string>Google Federal Case law</string>
	<key>LinksEnd</key>
	<string>&lt;font size=-1&gt;Result Page: &lt;/font&gt;</string>
	<key>LinksMatching</key>
	<array>
		<string>*/scholar_case?case*</string>
	</array>
	<key>LinksNotMatching</key>
	<array>
		<string>*/scholar?q=related*</string>
		<string>*/scholar_case?about*</string>
		<string>*/scholar?cluster*</string>
		<string>*/scholar?cites*</string>
		<string>*images.google.com/images*</string>
		<string>*video.google.com/videosearch*</string>
		<string>*news.google.com/news*</string>
		<string>*google.com/products*</string>
		<string>*mail.google.com/mail*</string>
		<string>*www.google.com*</string>
		<string>*books.google.com/books*</string>
		<string>*translate.google.com*</string>
		<string>*youtube.com*</string>
		<string>*docs.google.com*</string>
		<string>*sites.google.com*</string>
		<string>*groups.google.com*</string>
	</array>
	<key>LinksStart</key>
	<string>&lt;font size=-1&gt;Results &lt;b&gt;</string>
	<key>Name</key>
	<string>Google Federal Case Law</string>
	<key>NoText</key>
	<array>
		<string>Youtube</string>
		<string>Photos</string>
		<string>Maps</string>
		<string>Images</string>
		<string>Groups</string>
		<string>Gmail</string>
		<string>Blogs</string>
		<string>Translate</string>
		<string>Reader</string>
		<string>Calendar</string>
		<string>Books</string>
		<string>Video</string>
		<string>Shopping</string>
		<string>Finance</string>
		<string>Sign</string>
		<string>Documents</string>
		<string>Scholar</string>
	</array>
	<key>OffsetPerPage</key>
	<integer>10</integer>
	<key>ResultsPerPage</key>
	<integer>10</integer>
	<key>Start</key>
	<integer>1</integer>
	<key>TextEnd</key>
	<string>&lt;script type="text/javascript"&gt;</string>
	<key>TextStart</key>
	<string>&lt;div id="gsl_opinion"&gt;</string>
	<key>TitleEnd</key>
	<string>&lt;/h2&gt;</string>
	<key>TitleStart</key>
	<string>&lt;h2 class="gsl_title"&gt;</string>
	<key>Version</key>
	<string>1.0</string>
</dict>
</plist>

It seems to find the title of the case, the date of the case, text of the case, excludes the unwanted links, etc. The search results are now very usable.

However, after 2-3 queries, Google continues to prevent access. Then I have to wait a day or two before I can use Google Scholar again, even from my regular web browser. Regular Google Queries continue to work fine. So, I’m still wondering if there is some bug. How can I tell what is actually being sent to Google?

I was hoping to use this as a model for series of 52 or so plugins that would allow querying using the scope of each individual US state, federal, etc.

Thanks in advance for any help.

AZPatentAttorney · January 24, 2010, 7:43pm

I removed the “LinksNotMatching” array. This might have fixed whatever problem with Google seeing that as an automated query. I was able to perform 3-4 queries in under an hour that returned 500-600 pages. The searches were all related, and, I don’t know how much was actually retrieved from Google.

If you try to copy and paste this, you need to convert the > and < to the single character equivelents whenver you find them inside the xml tags, for example, DateStart, DateEnd, LinksStart, LinksEnd, TextStart, TextEnd, TitleStart, TitleEnd
```

<?xml version="1.0" encoding="UTF-8"?> DateEnd </center>

<p>
DateStart
</b></p></center>

<center>
Description
Google Federal Case Law Only
EncodingUrl
UTF-8
EngineUrl
http://scholar.google.com/scholar?start=agentOffset&q=agentQuery&hl=en&as_sdt=2003&as_vis=1
Identifier
com.devon-technologies.GoogleFederalCaseLawMatchOnly.plugin
Info
Google Federal Case Law Only - searches federal case law on Google Scholar
LinksEnd
<font size=-1>Result Page: </font>
LinksMatching

/scholar_case?case

LinksStart
<font size=-1>Results <b>
Name
Google Federal Case Law Only
NoText

Youtube
Photos
Maps
Images
Groups
Gmail
Blogs
Translate
Reader
Calendar
Books
Video
Shopping
Finance
Sign
Documents
Scholar

OffsetPerPage
10
ResultsPerPage
10
Start
1
TextEnd
<script type=“text/javascript”>
TextStart
<div id=“gsl_opinion”>
TitleEnd
</h2>
TitleStart
<h2 class=“gsl_title”>
Version
1.0

AZPatentAttorney · January 28, 2010, 4:13pm

I guess i am learning just how tricky it can be to create a plugin.

This version tries to prevent following YouTube links, which are appearing for reasons that I do not understand.

However, I am happy to report, I have not been booted off google in some time.

If you try to copy and paste this, you need to convert the > and < to the single character equivelents whenver you find them inside the xml tags, for example, DateStart, DateEnd, LinksStart, LinksEnd, TextStart, TextEnd, TitleStart, TitleEnd

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>DateEnd</key>
	<string>&lt;/center&gt;

&lt;p&gt;</string>
	<key>DateStart</key>
	<string>&lt;/b&gt;&lt;/p&gt;&lt;/center&gt;

&lt;center&gt;</string>
	<key>Description</key>
	<string>Google Federal Case Law Only</string>
	<key>EncodingUrl</key>
	<string>UTF-8</string>
	<key>EngineUrl</key>
	<string>http://scholar.google.com/scholar?start=_agentOffset_&amp;q=_agentQuery_&amp;hl=en&amp;as_sdt=2003&amp;as_vis=1</string>
	<key>Identifier</key>
	<string>com.devon-technologies.GoogleFederalCaseLawMatchOnly.plugin</string>
	<key>Info</key>
	<string>Google Federal Case Law Only - searches federal case law on Google Scholar</string>
	<key>LinksEnd</key>
	<string>&lt;font size=-1&gt;Result Page: &lt;/font&gt;</string>
	<key>LinksMatching</key>
	<array>
		<string>*scholar.google.com/scholar_case?case*</string>
	</array>
	<key>LinksStart</key>
	<string>&lt;font size=-1&gt;Results &lt;b&gt;</string>
	<key>Name</key>
	<string>Google Federal Case Law Only</string>
	<key>NoText</key>
	<array>
		<string>Youtube</string>
		<string>Photos</string>
		<string>Maps</string>
		<string>Images</string>
		<string>Groups</string>
		<string>Gmail</string>
		<string>Blogs</string>
		<string>Translate</string>
		<string>Reader</string>
		<string>Calendar</string>
		<string>Books</string>
		<string>Video</string>
		<string>Shopping</string>
		<string>Finance</string>
		<string>Sign</string>
		<string>Documents</string>
		<string>Scholar</string>
	</array>
	<key>OffsetPerPage</key>
	<integer>10</integer>
	<key>ResultsPerPage</key>
	<integer>10</integer>
	<key>Start</key>
	<integer>1</integer>
	<key>TextEnd</key>
	<string>&lt;script type="text/javascript"&gt;</string>
	<key>TextStart</key>
	<string>&lt;div id="gsl_opinion"&gt;</string>
	<key>TitleEnd</key>
	<string>&lt;/h2&gt;</string>
	<key>TitleStart</key>
	<string>&lt;h2 class="gsl_title"&gt;</string>
	<key>Version</key>
	<string>1.0</string>
</dict>
</plist>

cgrunenberg · February 4, 2010, 12:54pm

In this case all results are hosted on Google’s servers whereas the results of a simple Google search are usually not hosted by Google’s servers. But disabling the option to ignore robot instructions might improve things.