Web scraping catalog sites

I need to get images and corresponding information fields from several catalog sites and put them in database or spreadsheet format. My plan is to import the sites, extract HTML images and then import image files, then extract text from product page. I suppose it will be doable if the pages have a standard structure, which is likely. I have a vague intuition that DTPro and AppleScript could help with this but I’m not familiar with the kind of processing these will allow. Would appreciate any tips from those who might be using DTPro in a similar way and any pointers to AppleScripts or plugins which could be used or modified to perform this task in DTPro in a more automated fashion than I could with a simple web browser. Thanks in advance for any insights.

Extracting embedded images is easy, everything else is probably much more complicated depending on the sites and your actual goals:


tell application id "DNtp"
	set theURL to "http://..."
	set theSource to download markup from theURL
	set theLinks to get embedded images of theSource base URL theURL
		
	repeat with thisLink in theLinks
		-- Insert code here
	end repeat
end tell

Thanks! This will get me started, for sure.

Tried various things but am unable to import the images.
“download URL thisLink” puts the data into the Appelscript result. How to save as JPG?
“import URL thisLink” requires a POSIX path which does not exist, only a URL. Don’t know how I get the images into DTP and how I specify where they are saved. Any help appreciated.

For example, this script will add all of the images on DEVONtechnologies’ home page to the DEVONthink Download queue (Window > Download Manager) from which you could choose the items you want and their destination in DEVONthink or to the Download folder in the file system.

tell application id "DNtp"
	set theURL to "http://www.devontechnologies.com"
	set theSource to download markup from theURL
	set theLinks to get embedded images of theSource base URL theURL
	
	repeat with thisLink in theLinks
		set theResult to add download thisLink
	end repeat
end tell