Download website with subpages as PDF

Hi,

I know I can import websites with their own file structure, but that’s not what I want as the result has not a good usability (lots oft folders an files).

I want to download a pages and its subpages (2 levels) as single PDF files. I can do this manually by surfing inside DT3 and download each page via the wheel menu.

Is there already a way to automate this out of the box or do I have to write an AppleScript?

Another possibility might be to download the pages vial the download manager and then convert them to PDF.

Thanks. It does not to work as the Downloader does not seem to retain Cookies and Sessions.

The session of your default browser or the session inside DEVONthink?

Inside DEVONthink. And there are other problems like stylesheet paths that are not translated to the offline filesystem.

Is it possible to use AppleScript inside DT3 to navigate through the site and download the webviews as PDF or should I automate Safari for this?

Why not use curl or wget? They’re made for just this: download webpages recursively.

1 Like

Can they deal with Logins, Sessions and Cookies? The Login of my CMS does not use HTTPAUTH.

Did you try to login in DEVONthink and then use the download manager?

Yes. The CMS seems to be problematic as SiteSucker has the same problems. So I try it with AppleScript.

But how do I save the PDF object of a window (it shows a browser)? This does not work:

tell application id "DNtp"
	tell viewer window 1
		set theTitle to name
		set contentAsPDF to PDF
		set theFile to (path to downloads folder) & theTitle & ".pdf"
		save contentAsPDF in theFile
	end tell
end tell

When I use create PDF document from URL I’m not logged in.

As this script could be only used while browsing on your own, what is or should be the advantage compared to using Data > Capture > PDF (also available via the navigation bar)?

Well in theory I can parse the HTML for subpages, open them and save them automated.

I’ve found a solution using plain AppleScript.

tell application id "DNtp"
	tell viewer window 1
		set theTitle to name
		set contentAsPDF to PDF
		set theURL to URL
		set theFile to open for access ((path to downloads folder as string) & theTitle & ".pdf") with write permission
		try
			write contentAsPDF to theFile
		on error
			close access theFile
		end try
		close access theFile
	end tell
end tell

Is there a way to directly write into the database (current group) or do I have to import the file and delete it from the source folder?

E.g. like this:

tell application id "DNtp"
	set theGroup to current group
	tell think window 1
		set theTitle to name
		set contentAsPDF to PDF
		set theURL to URL
		set theRecord to create record with {name:theTitle, type:PDF document, URL:theURL} in theGroup
		set data of theRecord to contentAsPDF
	end tell
end tell
1 Like

Thanks a lot for you live-support. That’s really great.

1 Like

Well, I’ve found the reason why the web downloader fails. The HTML source does not contain links as the content seem to be generated by JavaScript after loading the URL. This seems to be a general problem in DT as in AppleScript the source property also does not contain the current DOM (so I can’t parse it for links). That’s interesting as the PDF does contain the full content which is generated by JavaScript.

If I understand you correctly, the PDF exported to DT from the browser is ok while the website downloaded with AppleScript in DT and then converted to PDF is missing the dynamically generated elements?
In that case, wget or curl wouldn’t help either. You need a browser to execute the JS code. AppleScript is so old, it has no idea of JavaScript or the DOM. You might have more luck with ObjCAppleScript, perhaps using a WebView.
Or writing a bookmarklet in JavaScript? Although I don’t quite see how all that would work to

Yes, you understand it correctly. I now temporary save the webarchive (which is also complete), get its source and delete the record. Then I can get all links.

As I load all pages into the view before saving as PDF I also can delete not needed elements with JavaScript. That’s great and the results are better then expected.