How to search for linked RSS article contents?

padillac · October 7, 2021, 8:26pm

I have created a search set with a single RSS feed set to crawl mode https://blog.testdouble.com/index.xml

I want to search the contents of the articles present on that feed.

Example: https://blog.testdouble.com/posts/2021-09-09-how-to-build-a-search-engine-with-ruby-on-rails/ contains the term “postgres” multiple times.

However if I search for postgres, that article doesn’t come up. It appears that DA only searches the RSS preview content, rather than the page content itself.

I have also tried setting “follow links” to no avail.

How can I set up a search set that will crawl all of the links in an RSS feed, and search the contents?

cgrunenberg · October 8, 2021, 9:47am

After enabling following of links (one level) this should work as desired.

padillac · October 8, 2021, 5:01pm

That’s what I thought, but it doesn’t, which is why I’m posting here. Here’s the search set:

Blog RSS.agentSet.zip (882 Bytes)

cgrunenberg · October 11, 2021, 2:43pm

Thank you! Currently following of links is only applied to matching pages in case of feeds, the next release will improve this.

padillac · October 19, 2021, 1:34am

Great to hear! I think it would be useful to get similar behavior with a sitemap.xml or sitemap.txt as well.

appoli · December 4, 2021, 2:19am

Would you be able to provide some more detail/info regarding what you meant by

Currently following of links is only applied to matching pages in case of feeds

Specifically:

what are the criteria for a link to be a “matching page” &
are feeds the only instance of this current restriction or does this hold true for every page searched by DEVONagent that happens to be an .xml file (for example)?

I’m curious as to how this logic effects end search results & if I should be aware that DA doesn’t go any further down when searching a url that is (or like?) xml RSS feed; so I don’t make the assumption that all related pages have been exhaustively searched if X is found in the results etc etc

Thanks so much for the help! Def gonna be looking forward to the next update

cgrunenberg · December 6, 2021, 10:58am

The page matches the primary (or secondary) search term (depending on the settings the title, text, keywords, description, URL and/or objects are used)

It’s only relevant for feeds.

padillac · December 14, 2021, 8:55pm

I hope you’ll consider it for sitemap.xml as well, since that’s probably the most reliable and comprehensive index of a site’s contents. Doing site search often fails because of search engine rate-limiting. If I could plug in sitemap (XML or txt) and treat it as a search engine results list, that would be great. Then DA would do the work of crawling each link and comparing it against my query.