Extracting pdf's from website

I have just started using Devonthink and DevonAgent and I trying to figure out exactly what is possible and the best way to use them.

Up until now I have extracted annual reports and presentations from company websites by downloading each manually. I have just started to use Devonthink to download all pdf’s from a company website and put them straight into Devonthink. This is a great time saver but I am wondering if I can make the process more efficient by using Devonagent.

Is it possible to have DevonAgent select all pdf’s from a website with the words “annual report” or “10-k” in the title and then have all these pdf’s imported into Devonthink?

Also, where can I get more information about plugins for DevonAgent. I have read the manual but I think that I need more information before I can fully understand what is going on with plugins. I noticed on this forum (http://www.devon-technologies.com/scripts/userforum/viewtopic.php?f=21&t=13148) that someone else was trying to use a plugin for the Edgar database. This would be useful to be able to search for, say all the 10-k for a particular company. The plugin given in the example will find a company but it does not seem to be able to find particular documents on the site. I am happy to try and amend this but need to be pointed in the right direction to learn exactly how to do it.

This should be scriptable. Do you want to download these PDF documents from one webpage or from a complete website (lots of webpages)? Could you please post a sample URL? Thanks!

The only available information right now is the manual (“Creating Your Own Plugins”, “XML Keys”).

From a complete website, for example: http://www.aigcorporate.com/investors/financial_reports.html

For example, on the AIG website above, I would like to download all 10-Q and 10-K from the Current tab, 2010,2009…2005.

Many thanks for your help.

This seems to be just one dynamic page, therefore you could use this script:


tell application "DEVONagent"
	try
		if not (exists browser 1) then error "No browser windows are open."
		set theSource to source of browser 1
		set theURL to URL of browser 1
		set theLinks to get links of theSource base URL theURL containing "Form 10"
		repeat with theLink in theLinks
			add download theLink
		end repeat
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONagent" message error_message as warning
	end try
end tell

Another solution is to open the “Objects” pane of DEVONagent’s browser window and to select the “Documents” scanner. Then Option-click on all PDF documents which should be downloaded.

Thank you. I will give it a try.

With regard to your suggestion of using Devonagent, I don’t seem to get the results that I was expecting. Please could you describe in more detail how to do this. For example, what search criteria do you use, when you select the “settings” tab ( I assume that is what you mean), option clicking PDF documents did not seem to make any difference to just clicking PDF documents.

I realise these are probably basic questions but I am new to DevonAgent.

Thank you for your help.

Open the page aigcorporate.com/investors/f … ports.html in a browser window, then choose View > Objects or press Cmd-Shift-B. Now select the “Documents” scanner, it’s the one selected in the screenshot with 87 objects:

Bildschirmfoto 2011-06-01 um 13.29.37.png

Option-clicking on the documents adds them to the download manager.

Thank you. I can now see how it works. That is going to be really useful.