Can Da be "neat" without XML?

Thoroughly intimidated by the construction of an XML plug-in, I made a few shortcuts. I got close, but need a little extra guidance.

USE CASE: I regularly use bizjournals.com. The site requires registration (free, but still required). Then, I enter a search for a topic, including a few parameters (number of results, time span) and get a page that contains:

  • A directory of links to articles
  • Advertising, site menus, etc.

I noticed that the URL of the Index page contains userID and password, as well as the terms of the search (looks like my-UserID-here:my-password-here@ … &am=&r=200 - Edited for generality).

So, I copied the URL into the “Site” panel of a new Search Set configuration, and knowing that (a) I am starting from the right page and (b) the links are really article names, which in most cases keep little relation to the contents, I followed advise on the Forum and entered an Asterisk ‘*’ for both “Default Query” and the box below “Follow links” – No name for the box???). In any case, I got 414 pages in the result set.

PROBLEMS:

  1. There were only 200 links to valid articles. The rest is site menus, side captions, paid news, etc --> More than 50% junk
  2. Most of the articles are multi-page (i.e., they have additional links for Page 2, Page 3, etc.). Which is a pain, because it forces me to be online when I read them (which I typically do while on the train, or in a boat, without Internet). IDEALLY, I would be able to further add a little intelligence and ask Da to NOT keep the page found by following the link, but rather the one that can be reached by following the link “Printable version” in that same page --> Approximately one third of the content I was looking for.

So, 50%+ junk and only 35% of the content I needed.

OTHER ATTEMPTS:

  • Using “Printable version” in the “unnamed box” under “follow links” - No result (because the “Index page” doesn’t contain it
  • Increasing number of levels - Gets the Printable version pages, but also LOTS of additional junk (the “Printable versions” don’t have anything in common that I can use to post-search filter the results to contain the Printable versions pages)

Any hints that don’t contain the words “Write a plug-in” and/or “XML”?

ADDITIONAL QUESTIONS:

  • What is the precedence between the “Settings panel” and Plug-ins? Are they “OR’ed” or “AND’ed” or it depends on the plug-in? In other words, if I select a Plug-in AFTER doing what I described above, will it narrow, expand or who-knows-what the scope of the results?

Since I have been responsible for optimising more than 80 plugins, maybe I should expand on this topic a bit and use your site as an example on how to create a plugin. It’s not difficult, but you need to spend some time massaging the plugin parameters for getting the best results.

Your approach was good but let me show you how to optimise the plugin. I will do this in several sections that will take some of the parameters you can set in the xml file into account. First of all, copy an existing plugin and rename it to “bizjournals.xml”. Then open it with PropertyListEditor.

[size=134]Setup of new xml file[/size]

You want to change the following values:

  • Name: bizjournals
  • Info: bizjournals Plugin bla bla bla
  • Description: something that makes people understand what it does
  • Version: 1.0
  • Identifier: your.domain.bizjournals.plugin (no spaces allowed, should be valid URL string)

[size=134]Searching[/size]

You need to go to the site (possibly login) and enter a search term and search. Hopefully you will get some results. I used “Apple iPod” as a search term. Here’s the URL I end up with:


http://www.bizjournals.com/search/bin/search?t=eastbay&am=eastbay&q=Apple+iPod&f=story&am=30_days&r=20

Also play with the advanced search facility on the site, change the values and make some new searches. Save all of these search URLs in TextEdit. Then it becomes easier to detect what is different.

Looking at the “XML Plugin Documentation”, we can massage this search URL for the plugin:

  • q=Apple+iPod is my search query, change this to: ```

q=agentQuery

* [i]r=20[/i] is the number of results on the list, this becomes: ```

r=_agentResults_

:bulb: On this site, the returned results can be quite large, so we don’t have to worry about another important variable: agentOffset. You would use this to allow DA to step through the different pages with returned results.

Set these parameters in the plugin xml:

  • EngineURL: ```

http://www.bizjournals.com/search/bin/search?t=eastbay&am=eastbay&q=agentQuery&f=story&am=30_days&r=agentResults

* [b]ResultsPerPage[/b]: ```

200

``` :bulb: Use the maximum allowed by the website.


:bulb: Try to check the site's "Searching Help" pages to see what kind of operators are allowed. This helps to determine what value to use for the [b]Operators[/b] parameter. For most sites, the default value is ok, but if you really want to optimise your plugin, setting the proper value here will help you with efficient complex searches.

[b][size=134]Filter search results pages links[/size][/b]

Now you want to see the source of the search results page. Because there are 3 approaches to filter results:

1. [b]LinksMatching[/b]: a list of links that DA should follow
1. [b]LinksNotMatching[/b]: a list of links that DA should skip
1. [b]LinksStart[/b] and [b]LinksEnd[/b]: defines an area in the page where DA should look for links.

You don't have to use all three of these. In my experience though, the more you specify the better your results will be even when the site changes its layout. Let's analyse this site in order to set as many of these parameters as possible.

[i]Since I can't login to this site I can only provide an indication of how to deal with the links.[/i]

Find out what the links to articles have in common. Sometimes they are all different and then you can't use this. Let's say they all have "[i]xxx.articles?id=[/i]" in common. Then you can use this with a wildcard as a parameter: [i]*xxx.articles?id=*[/i].

You use the same approach for links that should be skipped. Check a few links on the page that you do not want to follow and see what they have in common. Each of these can be entered in the parameter.

And last but not least, you can indicate where DA should look for links. View the source and look for the string that is shown on every search results page and is close to the result links. Here, "[i]required to view the stories below.[/i]" is the closest. Note that it doesn't contain formatting so the chances are good that it will be completely intact in the HTML. When we search the source, it is confirmed that this is true.

So where do the links end? If you look at the bottom of the page you see a link called "Home". Searching in the source we see that there is a bunch of nonsense before that but we see an HTML comment: "[i]<!--end main content area-->[/i]. This is great! Now we have the area that we will use to instruct DA to search for links.

Since I don't have a login for the sites I will only use these parameters:

* [b]LinksStart[/b]: ```

required to view the stories below.

  • LinksEnd: ```


Now is a good time to save your changes and use this plugin in DA. Then use it to search for something and start to fine-tune it. You could see what kind of links are junk and have a common component and this would be a candidate for the [b]LinksNotMatching[/b] list. Ditto for [b]LinksMatching[/b]. Writing a plugin is an iterative process!

[b][size=134]Filter result pages[/size][/b]

Sometimes all the result pages have a common format. [i]Only then can you use parameters in items 1-3![/i]


1. [b]TitleStart[/b] and [b]TitleEnd[/b]: extract the title of the article
1. [b]TextStart[/b] and [b]TextEnd[/b]: extract the article text
1. [b]DateStart[/b] and [b]DateEnd[/b]: extract the article's date
1. [b]NoTopics[/b]: suppress useless topics that always occur for any search on a site


In order to determine values for these parameters, you will need to look at the source of a couple of articles and try to find strings that can be used to extract these values. This is similar to the [b]LinksStart[/b] and [b]LinksEnd[/b] extraction process. The [b]NoTopics[/b] list can be set when looking at the Topics pane in DA after you've done your searches. Any word that you think should not be included you can add to this list.

[b][size=134]DEVONagent Search Sets panel[/size][/b]

If you add your plugin and select it on the plugins pane, you can set a different value for the returned results. This will always override the value you entered in your plugin.

Also, the Settings panes values will override the values you set in your plugin(s).

[b][size=134]Conclusion[/size][/b]

It could be possible to write much more about writing these plugins, but I hope that what I wrote here will help you pass some hurdles when writing your own. And remember if you think your plugin is great for others, share it with us and we can optimise it together!