How do I create a crawler set?

For example, let’s say I want to create a search setting that searches the Life Extension Foundation website. Their url is lef.org.

If all I do is add lef.org, nothing is happening, I get zero results returned. So obviously I need to do more than that. The user doc has a sentence about adding a string to utilize the website’s built in search functionality but the doc is, well, less than helpful.

can anyone suggest how i do this?

rmathes 23:

[1] Select File > New Search to open a search window.

[2] Click on the ‘magnifying glass’ to the left of the search field. Scroll down and click on Edit Search Sets. Click on the + symbol to add and name it, perhaps “Life Extension”.

[3] Enter a default search string. Example: Life. Click on Follow links, enter the same string (for example).

[4] I usually check all the filters.

[5] Click on the Sites tab. At the bottom of the pane, enter your URL, . Then click on the + symbol to add it.

[6] That’s it. Any time you want to run it, select this set from the list and hit the Search button.

Hope this helps.

hey Bill…did all that, except for entering a default search string (never quite sure of the value in that since i’d never run a search without entering a search term). The query runs and I get nothing.

for example, i created a new search set called “medical sites”. set it to follow max links. added the site "http://www.lef.org. run the query on the term ‘osteoarthritis’.

nothing.

i’m missing something.

You must enter a default search term when creating a new search set.

Tried to check it, but the lef site is down for maintenance at the moment.

i went back and added the default search term ‘osteoarthritis’.

nothing changed.

I took a quick look at the site and I believe you may need to create an XML plug-in. Rather than linking articles off of the main URL (lef.org), the site has its own internal search engine. That’s why the “Follow Links” option on your search set is not giving you the hits that you expect.

that’s what i figured it was, but the user doc seems to indicate there’s a way to construct the url in the sites pane such that it will leverage the site’s search functionality. i just can’t figure out the user doc instructions on this.

also, how do I determine which sites need this and which don’t? is the mere presence of a search field sufficient to make that determination?

Just checked the URL at 3:41 PM Central Time – the site is down for maintenance still, so no search strategy is going to work!

My test set used ‘life’ for both default and follow links term. Earlier today, the site was up and I got 62 hits. Even though the site has it’s own search funtion, DA was able to act like a human viewer of the site would, and to find pages that contained the search string.

Bil, I appreciate your help, sir.

i’m confused. i realized one thing i wasn’t doing was populating the follow links field with a default search term. one problem with the screen layout is it doesn’t make it clear just what the heck that field is. that should be remedied in a future version. it just says follow links with a blank field underneath.

now when i put ‘life’ into both fields, then i get some hits but it’s all basically product returns. it doesn’t access the meat of their research.

the thing I’m finding REALLY confusing is if I change the default search terms from ‘life’ to ‘test’, i get only one link returned. so that begs the question, how does DA use the default search term? i figured it was a placeholder that would be substituted with the entered search term, but that’s clearly not the case. this doesn’t make any sense to me.

Now, with ‘life’ as the default search terms, if I go to their search field and enter “fibromyalgia”, i get a ton of hits. At the bottom of the page I can narrow the query to just return results from their medical abstracts. The URL of this native search is…

search.lef.org/src-cgi-bin/MsmFi … =X&NO_DL=X

so it still seems like i’m not getting the same returns i’d get as if i manually crawled the website. not sure where to go from here.

is there a way to use DA to leverage a site’s internal search functionality?

is there a way to use DA to leverage a site’s internal search functionality?

For this, you need plugins that “tell” Devonagent how to access the sites search function.
I am trying to find out how this precisely works right now, but am having some difficulties as well … I actually posted a question on this in another board. Will keep you posted.

Best,
Ben

I realize this is now an old discussion, but I was going through exactly the same process and just discovered the solution. I too wanted to get DA to crawl a particular site and grab every page from it. I discovered that you need to enter an asterisk [i.e., shift- 8] in BOTH the default query and the box under “Follow Links”. Once I entered that second asterisk and hit Go, DA took off and happily dl’d the entire website for me. I’m happy. Hope this helps the next person trying to get DA to crawl sites.

Mind you, it’s now up to 608 pages, so I will clearly need to introduce some filtering and restrictions!