Download website with subpages as PDF

Tekl · March 19, 2021, 9:54am

Hi,

I know I can import websites with their own file structure, but that’s not what I want as the result has not a good usability (lots oft folders an files).

I want to download a pages and its subpages (2 levels) as single PDF files. I can do this manually by surfing inside DT3 and download each page via the wheel menu.

Is there already a way to automate this out of the box or do I have to write an AppleScript?

cgrunenberg · March 19, 2021, 9:57am

Another possibility might be to download the pages vial the download manager and then convert them to PDF.

Tekl · March 19, 2021, 10:49am

Thanks. It does not to work as the Downloader does not seem to retain Cookies and Sessions.

cgrunenberg · March 19, 2021, 10:50am

The session of your default browser or the session inside DEVONthink?

Tekl · March 19, 2021, 11:02am

Inside DEVONthink. And there are other problems like stylesheet paths that are not translated to the offline filesystem.

Tekl · March 19, 2021, 11:03am

Is it possible to use AppleScript inside DT3 to navigate through the site and download the webviews as PDF or should I automate Safari for this?

chrillek · March 19, 2021, 11:05am

Why not use curl or wget? They’re made for just this: download webpages recursively.

Tekl · March 19, 2021, 11:06am

Can they deal with Logins, Sessions and Cookies? The Login of my CMS does not use HTTPAUTH.

cgrunenberg · March 19, 2021, 11:30am

Did you try to login in DEVONthink and then use the download manager?

Tekl · March 19, 2021, 11:38am

Yes. The CMS seems to be problematic as SiteSucker has the same problems. So I try it with AppleScript.

But how do I save the PDF object of a window (it shows a browser)? This does not work:

tell application id "DNtp"
	tell viewer window 1
		set theTitle to name
		set contentAsPDF to PDF
		set theFile to (path to downloads folder) & theTitle & ".pdf"
		save contentAsPDF in theFile
	end tell
end tell

When I use create PDF document from URL I’m not logged in.

cgrunenberg · March 19, 2021, 11:52am

As this script could be only used while browsing on your own, what is or should be the advantage compared to using Data > Capture > PDF (also available via the navigation bar)?

Tekl · March 19, 2021, 11:53am

Well in theory I can parse the HTML for subpages, open them and save them automated.

Tekl · March 19, 2021, 11:55am

I’ve found a solution using plain AppleScript.

tell application id "DNtp"
	tell viewer window 1
		set theTitle to name
		set contentAsPDF to PDF
		set theURL to URL
		set theFile to open for access ((path to downloads folder as string) & theTitle & ".pdf") with write permission
		try
			write contentAsPDF to theFile
		on error
			close access theFile
		end try
		close access theFile
	end tell
end tell

Is there a way to directly write into the database (current group) or do I have to import the file and delete it from the source folder?

cgrunenberg · March 19, 2021, 12:01pm

E.g. like this:

tell application id "DNtp"
	set theGroup to current group
	tell think window 1
		set theTitle to name
		set contentAsPDF to PDF
		set theURL to URL
		set theRecord to create record with {name:theTitle, type:PDF document, URL:theURL} in theGroup
		set data of theRecord to contentAsPDF
	end tell
end tell

Tekl · March 19, 2021, 12:07pm

Thanks a lot for you live-support. That’s really great.

Tekl · March 19, 2021, 12:52pm

Well, I’ve found the reason why the web downloader fails. The HTML source does not contain links as the content seem to be generated by JavaScript after loading the URL. This seems to be a general problem in DT as in AppleScript the source property also does not contain the current DOM (so I can’t parse it for links). That’s interesting as the PDF does contain the full content which is generated by JavaScript.

chrillek · March 19, 2021, 1:40pm

If I understand you correctly, the PDF exported to DT from the browser is ok while the website downloaded with AppleScript in DT and then converted to PDF is missing the dynamically generated elements?
In that case, wget or curl wouldn’t help either. You need a browser to execute the JS code. AppleScript is so old, it has no idea of JavaScript or the DOM. You might have more luck with ObjCAppleScript, perhaps using a WebView.
Or writing a bookmarklet in JavaScript? Although I don’t quite see how all that would work to

Tekl · March 19, 2021, 1:55pm

Yes, you understand it correctly. I now temporary save the webarchive (which is also complete), get its source and delete the record. Then I can get all links.

As I load all pages into the view before saving as PDF I also can delete not needed elements with JavaScript. That’s great and the results are better then expected.