Plugin using ArXiv API

rpenco · April 17, 2024, 2:24pm

Hello, I am trying to develop a plugin that searches the ArXiv using their API. I am aware that there are issues crawling the main site, but my understanding is that the “export” version should work for this purpose. I am using the following code:

<plist version="1.0">
<dict>
	<key>CrawlDelay</key>
	<real>0.125</real>
	<key>Description</key>
	<string>Search bulk version of ArXiv</string>
	<key>EngineUrl</key>
	<string> https://export.arxiv.org/api/query?search_query=all:_agentQuery_&amp;start=_agentOffset_&amp;max_results=_agentNumber_</string>
	<key>Identifier</key>
	<string>export.arxiv.org.plugin</string>
	<key>Info</key>
	<string>ArXiv (bulk) Plugin</string>
	<key>LinksMatching</key>
	<string>*arxiv.org/abs*</string>
	<key>Name</key>
	<string>ArXiv (bulk)</string>
	<key>Operators</key>
	<integer>34867</integer>
	<key>ParseLinks</key>
	<true/>
	<key>Version</key>
	<string>1.0</string>
</dict>
</plist>

but, when I test it, it returns zero links even though when I click on the addresses that are searched I can see some results in ATOM format. I have removed arxiv from the excluded domains but this didn’t solve the problem. Can anyone please help me troubleshoot this? Thanks!

cgrunenberg · April 18, 2024, 11:23am

There’s an unnecessary space right before https:. In addition, the results page is actually not HTML but XML (Atom) and this requires that the plugin specifies the keys ResultsKeyPath, LinksKeyPath and optionally e.g. DescriptionKeyPath and TitleKeyPath. At least theoretically because in this case it doesn’t work due to the MIME type, a future release should improve this.

cgrunenberg · April 18, 2024, 11:47am

Actually I was wrong, it is already possible:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>CrawlDelay</key>
	<real>0.125</real>
	<key>Description</key>
	<string>Search bulk version of ArXiv</string>
	<key>DescriptionsKeyPath</key>
	<string>summary</string>
	<key>EngineUrl</key>
	<string>https://export.arxiv.org/api/query?search_query=all:_agentQuery_&amp;start=_agentOffset_&amp;max_results=_agentNumber_</string>
	<key>Identifier</key>
	<string>export.arxiv.org.plugin</string>
	<key>Info</key>
	<string>ArXiv (bulk) Plugin</string>
	<key>LinksKeyPath</key>
	<string>link[0].href</string>
	<key>Name</key>
	<string>ArXiv (bulk)</string>
	<key>Operators</key>
	<integer>34867</integer>
	<key>ResultsKeyPath</key>
	<string>entry</string>
	<key>TitlesKeyPath</key>
	<string>title</string>
	<key>Version</key>
	<string>1.0</string>
</dict>
</plist>
``

rpenco · April 18, 2024, 2:32pm

Than you for your prompt reply and your help, I would have never figured this out by myself!

I had read the documentation before posting this question, but hadn’t appreciated that the keys ResultsKeyPath and LinksKeyPath were required for XML – they are listed together with DescriptionKeyPath and TitleKeyPath in the “JSON AND XML-SPECIFIC KEYS” section. I am wondering if it would make sense to move them to the section “NECESSARY KEYS” instead, with the added caveat that they are necessary for XML.

I will play around with your code and might post some follow up questions if I can think of ways to improve it that I am unable to implement by myself.

Thanks again!

cgrunenberg · April 18, 2024, 2:37pm

These keys are indeed required for XML/JSON whereas DescriptionKeyPath & TitleKeyPath are optional. I have forwarded your suggestion.