Good search results appear in google, but not in DEVONagent

I’ve been considering buying the package that includes both DEVONthink pro and DEVONagent. The price difference isn’t that great, and so far DEVONagent has appeared to be quite capable.

However, still a bit in doubt, I decided to run a test case. To my surprise, a result that appeared at the top of the page in google did not appear in DEVONagent at all. I used the fast search.

Now here’s the thing. Even though I have no doubt that the top result might not have been great content-wise, it did cover the topic perfectly, and the title was exactly the search string!

If anyone wants to replicate this. I searched for ‘Stilte in Taize’, which happens to be a post on my personal weblog. I hope nobody considers this blog spamming; it’s dutch, so it won’t be of interest to most of you.

This ‘problem’ really bothers me, in particular because I don’t understand what goes wrong. Again, I by no means consider the information great, but it’s just odd that a top google result wouldn’t show up in DEVONagent at all…

If someone could clear this up, I could go and buy the two programs :wink:

What’s the URL of the blog/the page you’re referring too? Did you search for ‘Stilte in Taize’ with quotes (looking for phrase) or without (looking for all words)?

In addition, if you’re using the “Internet (Fast Scan)” default set, then you might disable the filter for similar pages.

The address is munandu.com.

I tried most combinations of search parameters, both with parenthese and the ‘and’ operator. I also looked in the log.

Even searching for the term ‘munandu’ gives no results. Furthermore, searching for specific terms in a post does bring the correct url up in the log, but lists it as ‘no match’.

The link does show in the log. I searched for the words Stilte and Taize

As far as I can see there is a difference in language. When searching as a Dutch client, the munandu site is the first hit in Google. In the English client, as used by DA, the link is further down in the log for the Google page.

But why doesn’t the site make it to the digest page. As seen in the log, the words “Stilte” and “Taize” are marked bold.

I also tried another post, “The Life of Brian”, which is completely in English. It also didn’t work there.

I can understand some influence of language, but to this degree?

Futhermore, my blog is not ‘classified’ anywhere as being dutch.

OK, I might be wrong in my previous post. So better ignore it for now.

When I search for “Stilte in Taize” (without quotes) in DA, the page for Google in the log is the same as the page would be in Safari. It is the top page in the log and the first entry on the page is the one for the munandu site. On the Yahoo page it is the third link. All filters are off and even the Settings are set to Dutch (Nederlands).

So DA does pick it up, but the question remains: why doesn’t the munandu page go the the digest?

I wonder…could it be that it somehow doesn’t read the content? Some bizar formatting error (I have been messing with the code, so it’s by no means standard wordpress).

it’s this and some other experiences that make me doubt DEVONagent. Quite a conundrum, as I would buy both it and DEVONthink Pro if I could be convinced it really does search all that much better. Could anyone reassure me that it really does work as great as advertised?

Still, the problem remains.

I believe the problem is related to the accenting character in the word “Taizé”. When I search on the page in question for “Taizé”, then Safari finds the word but when I search using “Taize” it does not. That is mostly likely why the content filter is removing the page.

In my experience, there is always an answer if a page shows up in a Google search but not in the DA index. I have check this many times because I also need to know that the search is realiable and I am satisfied. Perhaps there does need to be a small adjustment in the code here?

I thought of that though, so I tried all my searches for both taizé and taize.

Furthermore, the posts slug (the url) does not have the é character.

And this doesn’t solve the issue for also cropping up for ALL my other searches inside the munandu domain (I did this for some English posts to test if it was a language issue).

It’s almost like the complete site is blacklisted from DEVONagent…

Well I just tried doing some phrases searche on Google using text cut and pasted from your page and Google doesn’t find them either so something is up with this page I am sure.

Ok, I just tried this for a recent post (yahoo, as google was inaccessible to me):

the life of brian “to my surprise” “not only hilarious”

It lists my site as a top result, yet does not appear in DEVONagent.

Again, it could be that DA is so clever that it considers my writing on the subject crap, but it doesn’t seem right that such a targeted search term wouldn’t display this result.

Or is DA just not very good at targeted searching? Isn’t that what it’s made for?

Hilko, there seems to be something strange about your blog page. I tried out sgmiller’s experiment of copying a phrase from your blog page and searching for it in Google.

Google can’t find it on your page. Nor can DEVONagent.

So you might want to take a look at how your blog is set up.

Sure. But you should have a look at the following options and maybe disable them:

  1. Ignore Umlauts
  2. Filter similar pages (this is much more aggressive than for example Google’s filtering of similar pages which is more or less only filtering duplicates)
  3. Filter junk

Well, depending on what text I enter as a search string, some pages do show up in google. It does suspiciously often link the front page rather than the actual posts though - I can imagine that throwing DEVONagent off (rightly so).

I don’t think this is the place to address the issue of my site being indexed ‘wrongly’ (would you know where I could go with such inquiries?), so I’ll leave it at that.

Well, thanks to the quick help and your faith in DA (and the nice Information Worker’s bundle), I’m happy to inform you that ordered DTpro and DA, and my registration key just arrived!

You’ll be hearing from me again, no doubt :wink:

Another difference is of course that DEVONagent downloads all pages and checks them whereas the index of any search engine is not always up-to-date.

The problem seems to be that this page is not XHTML compliant and contains RDF tags too. But the next release will handle this.

Surely a search engine should’t be touchy about bad markup? After all, it’s all about the content; a ‘seeker’ generally wouldn’t care if the ‘container’ is shoddy, would he?

I think the point here is that DA does an excellent job of filtering pages but that does mean, by definition, that not all pages will make it past the filter. In almost every case, the filter does what it is supposed to do which is rule out pages that don’t have relevant information. In this case, there appears to be something off on the page itself which is also giving Google a problem.

I think a much more important issue arises when a user wants all pages regardless of whether or not the search term(s) is found. For example, what if I want to know about pages that are no longer available but are contained in the Google cache? That is why I have suggested a “switch” to allow the user to turn off the content filter.

As DEVONagent’s HTML and feed parsers contain already lots of workarounds, you shouldn’t ask me that question :wink:

Hehe, as an aside, don’t consider this an attack or even criticism of DA per sé. I’m just anxious to figure out what goes wrong and why.

As for my site being the problem; this may be so in part, but the search string ‘christian charity munandu’ gives a perfect result in google. It’s not much of a real-life scenario to search for this, but it still baffles me why DA doesn’t find the page in this way.

Furthermore, my site uses wordpress, and isn’t altered all that much. So I can’t quite think of what exactly could have changed the content so radically that it isn’t indexed.

I think part of the indexing problem on google’s site is just the very low traffic and pagerank, not neccesarily content.