One of the most important aspects of using plugins is to understand what the different search engines represent. I’m sure we’ve all tried the “select them all” approach, and been horrified by the results.
The list of ‘Search’ plugins is particularly difficult to make sense of, partly because of the huge number of unknown engines (e.g. the many meta search engines which should be under the ‘meta’ category).
Your tool in understanding these plugins is here:
http://searchenginewatch.com/
In looking at their Major Search Engines overview (http://searchenginewatch.com/showPage.html?page=2156221), and other global engines page (http://searchenginewatch.com/showPage.html?page=2156281) , we learn a few things:
- Hotbot is google, yahoo, and teoma
- Altavista is Yahoo
- Wisenut is Looksmart
- Gigablast is unique (who knew?)
- Lycos is Looksmart and Yahoo
- MSN is (was?) Looksmart and Yahoo
- Excite is Overture and Inktomi
The information on other engines can usually be found by going to them directly and checking their About pages. This provides some additional info:
- Alexa is unique
- Alltheweb is Yahoo
- Ask is Teoma
- Complete Planet is a collection of specialized databases and engines
…well, it wouild take all night to look them all up. The only two observations I have to make at this point are 1) if you are going for coverage, choose search engines that do no overlap, and don’t bother with meta-engines (DA or DT can deal with ranking and relevance), and 2) ‘collections’ such as Complete Planet are useful only for exhaustive searches, where you want every reference no matter how insignificant (e.g. academic and possibly journalistic research) … otherwise you are better off trying smaller, more specialized engines that are likely a part of these collections.
To discover specialized search engines, you really just have to browse around a littile:
http://www.search-engine-index.co.uk/
http://www.searchengineguide.com/searchengines.html
Who knew there were so many? Aside from the searchlores guys, that is, and they have way too much time on their hands.
It also never hurts to try some searches out on the engines directly; SearchSight, for example, sounds good but proved to have mostly commercial (product) sites for topics on which I wanted technical information, and many of the smaller engines (Fybersearch, IncyWincy) had either zero hits or the highest noise-to-signal ratio.
So, back on topic.
My automatic searches tend to be based on the following needs:
- keep abreast of the state of the art of certain ideas or technologies
- acquire (and import into DT) every publication concerning a specific topic (and closely-related topics)
- keep track of references to our company or product
- same, for competitors and partners
I have found that though the first two are similar, they must be distinct searches; the first changes daily and must be tightly focussed, while the second changes infrequently (after the initial deluge) and needs to be fuzzy yet focussed to sites and document types. I think of these now as “what people are working on in my field” and “what work has been done in my field”, which is basically the difference between subscribing to a magazine and going to the library.
The last two are challenging, and the source of no end of frustration: lots of commercial sites and link aggregators, all to get a few press releases.
All of that said, here is what I am experimenting with for search sets.
Coverage (“broad”):
(Options: Follow no links, filter similar/archived/junk pages)
Alexa
Ask (Teoma)
Gigablast
Internet (MSN, Yahoo, Google)
Looksmart Web (*)
LuckyGuess
ScrubTheWeb
Thunderstone
Wotbot
Exhaustive (“deep”):
(Options: Follow +1 level of links, filter similar/archived/junk pages)
Citeseer
Complete Planet
Galaxy
GoogleScholar
Gutenberg
InfoPlease
IngentaConnect
Internet Public Library
KosersOSS
Looksmart Articles (*)
ResourceFinder
Scirus
SMEALsearch
US Patent Office
Usenet
Wikipedia
WIPO
So far these are fairly complementary. I can do broad topic searches in ‘coverage’, then do targeted searches on the resulting topics using ‘exhaustive’. Sure ‘exhaustive’ gives many redundant results, but that is the point. Notice what I am going for in these: an indiscriminate collection of any mention of the keywords (coverage) versus a spidered (“deep”) search of content providers for highly specific key words and phrases.
A note on exhaustive: most of my research has to do with computer science, biomedical/bioinformatics, engineering and mathematics, and the engine selection reflects this. My choices for a Literature ‘exhaustive’ search, for example, would be a bit different. If I ever have time to study Lit again I’ll post those 
I am still toying with my content searches. These are pretty domain-specific, so it is hard to recommend out-of-the-box lists.
For CS stuff, which is hugely publish-or-perish, I am thinking of the following:
ACM (*)
Citeseer
IEEE (no plugin yet)
IngentaConnect
KosersOSS
SMEALsearch
US Patent Office
WIPO
I’d like use the attached documents scanner on these, since I am interested in papers and most of these have attached PDF pr PS documents, but a search for something like “shape analysis” on ACM yields 3 results with the linked document scanner and 23 results with no scanner (I have personally downloaded at least 10 PDF/PS documents from the ACM on this subject).
In addition I am thinking of some source-code specific plugins to do fast searches, e.g. to scan existing python or ruby code for an example of solving a certain problem.
I’m still working on the ‘keeping track of trends/competitors/etc’ search sets; those are the most difficult to get manageable.
I have three plugins to add which I am using in the above search sets, labelled (*). I’ll paste them inline here; they are a mere first stab at proper plugins, and have had just enough work put in to make them minimally functional. To use, copy an existing plugin and replace everything between (and including) the dict tags.
LookSmart Web:
<dict>
<key>Description</key>
<string>LookSmart (http://search.looksmart.com/) search engine (web index)</string>
<key>EngineUrl</key>
<string>http://search.looksmart.com/p/search?qt=_agentQuery_&sb=web&sn=_agentOffset_</string>
<key>Identifier</key>
<string>localhost.looksmart.plugin</string>
<key>Info</key>
<string>Looksmart Web Plugin (1.0)</string>
<key>Name</key>
<string>LookSmart Web</string>
<key>Operators</key>
<integer>27</integer>
<key>OffsetPerPage</key>
<integer>10</integer>
<key>ResultsPerPage</key>
<integer>10</integer>
<key>Start</key>
<integer>0</integer>
<key>Version</key>
<string>1.0</string>
</dict>
(I found LookSmart does better when you specify the search type such as Web or Article; its default All mode tries to be clever and discards useful results.)
Looksmart Articles:
<dict>
<key>Description</key>
<string>LookSmart (http://search.looksmart.com/) search engine (Articles database)</string>
<key>EngineUrl</key>
<string>http://search.looksmart.com/p/search?qt=_agentQuery_&sb=art&sn=_agentOffset_</string>
<key>Identifier</key>
<string>localhost.looksmart.plugin</string>
<key>Info</key>
<string>Looksmart Articles Plugin 1.0</string>
<key>Name</key>
<string>LookSmart Articles</string>
<key>OffsetPerPage</key>
<integer>10</integer>
<key>Operators</key>
<integer>27</integer>
<key>ResultsPerPage</key>
<integer>10</integer>
<key>Start</key>
<integer>0</integer>
<key>Version</key>
<string>1.0</string>
</dict>
ACM:
<dict>
<key>Description</key>
<string>Searches the Association of Computing Machinery Digital Library.</string>
<key>EngineUrl</key>
<string>http://portal.acm.org/results.cfm?query=_agentQuery_&sn=_agentOffset_</string>
<key>Identifier</key>
<string>localhost.ACMSearch.plugin</string>
<key>Info</key>
<string>ACM Search Plugin 1.1 (BSD License)</string>
<key>Name</key>
<string>ACM Digital Library</string>
<key>Operators</key>
<integer>4121</integer>
<key>OffsetPerPage</key>
<integer>20</integer>
<key>ResultsPerPage</key>
<integer>20</integer>
<key>Start</key>
<integer>1</integer>
<key>Version</key>
<string>1.0</string>
</dict>
Thoughts for future directions in DT: the ability to use more than one scanner, the ability to download linked files directly into DT (sending the link and downloading from within DT is a little hit-or-miss), and the ability to repaste arbitrary CGI parameters inside a plugin (e.g. when supplied the name of the param containing the session token, the plugin will repaste the param and value each time). Actually having the plugins prompt the user (by invoking an applescript maybe?) for details might not be too bad, though it breaks scheduled searches.
Well that should wrap up this russian novel of a post.
–Eric
PS Thx to boneskull for prodding me to actually write all this 