Extracting pdf's from website

pkoak · May 22, 2011, 6:05pm

I have just started using Devonthink and DevonAgent and I trying to figure out exactly what is possible and the best way to use them.

Up until now I have extracted annual reports and presentations from company websites by downloading each manually. I have just started to use Devonthink to download all pdf’s from a company website and put them straight into Devonthink. This is a great time saver but I am wondering if I can make the process more efficient by using Devonagent.

Is it possible to have DevonAgent select all pdf’s from a website with the words “annual report” or “10-k” in the title and then have all these pdf’s imported into Devonthink?

Also, where can I get more information about plugins for DevonAgent. I have read the manual but I think that I need more information before I can fully understand what is going on with plugins. I noticed on this forum (http://www.devon-technologies.com/scripts/userforum/viewtopic.php?f=21&t=13148) that someone else was trying to use a plugin for the Edgar database. This would be useful to be able to search for, say all the 10-k for a particular company. The plugin given in the example will find a company but it does not seem to be able to find particular documents on the site. I am happy to try and amend this but need to be pointed in the right direction to learn exactly how to do it.

cgrunenberg · May 31, 2011, 10:06am

This should be scriptable. Do you want to download these PDF documents from one webpage or from a complete website (lots of webpages)? Could you please post a sample URL? Thanks!

The only available information right now is the manual (“Creating Your Own Plugins”, “XML Keys”).

pkoak · May 31, 2011, 5:38pm

From a complete website, for example: http://www.aigcorporate.com/investors/financial_reports.html

For example, on the AIG website above, I would like to download all 10-Q and 10-K from the Current tab, 2010,2009…2005.

Many thanks for your help.

cgrunenberg · June 1, 2011, 10:02am

This seems to be just one dynamic page, therefore you could use this script:


tell application "DEVONagent"
	try
		if not (exists browser 1) then error "No browser windows are open."
		set theSource to source of browser 1
		set theURL to URL of browser 1
		set theLinks to get links of theSource base URL theURL containing "Form 10"
		repeat with theLink in theLinks
			add download theLink
		end repeat
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONagent" message error_message as warning
	end try
end tell

Another solution is to open the “Objects” pane of DEVONagent’s browser window and to select the “Documents” scanner. Then Option-click on all PDF documents which should be downloaded.

pkoak · June 1, 2011, 10:06am

Thank you. I will give it a try.

pkoak · June 1, 2011, 10:20am

With regard to your suggestion of using Devonagent, I don’t seem to get the results that I was expecting. Please could you describe in more detail how to do this. For example, what search criteria do you use, when you select the “settings” tab ( I assume that is what you mean), option clicking PDF documents did not seem to make any difference to just clicking PDF documents.

I realise these are probably basic questions but I am new to DevonAgent.

Thank you for your help.

cgrunenberg · June 1, 2011, 11:31am

Open the page aigcorporate.com/investors/f … ports.html in a browser window, then choose View > Objects or press Cmd-Shift-B. Now select the “Documents” scanner, it’s the one selected in the screenshot with 87 objects:

Bildschirmfoto 2011-06-01 um 13.29.37.png

Option-clicking on the documents adds them to the download manager.

pkoak · June 1, 2011, 11:49am

Thank you. I can now see how it works. That is going to be really useful.