Is capturing PDFs from developer.apple.com broken?

A delay might help in some cases. But it’s not a guarantee that the page is fully loaded, and it did not help against lazily loaded parts of the page.

1 Like

Maybe to give you an idea: here’s what I’m doing to capture Linkedin pages. I run the following Applescript (triggered from another Python script that’s running a list of my profiles to capture from a .csv file). That script calls a Javascript at some point to ‘clean’ and scroll the Linkedin page before running “Export to PDF” from Safari. That’s what delivers the best results so far… (I’ll be creating a Github repository for these)

The command-shift-option-s command in that script is my shortcut for “Export as PDF” which seemed to work more reliable than clicking the menubar at some point (but might have been due to other issues)

on run argv
	set contactInfo to missing value
	set theCurrentDirectory to do shell script "pwd"
	set theScriptFile to (POSIX file (theCurrentDirectory & "/save_profiles.js") as alias)
	set theScript to read theScriptFile
	
	tell application "Safari"
		activate
		if front window exists then
			close tabs of front window
		end if
		open location (item 1 of argv)
		delay 5
		set bounds of front window to {0, 0, 1400, 1000}
		
		tell front document
			-- set readyState to ""
			
			set triedState to 0
			repeat
				set readyState to do JavaScript "document.readyState"
				if readyState is "complete" or readyState is "interactive" then exit repeat
				if triedState is greater than 20 then tell me to error "Page is not loading" number 1
				set triedState to triedState + 1
				delay 1
			end repeat
			
			do JavaScript theScript
			
			set triedInfo to 0
			repeat
				set contactInfo to do JavaScript "document.contactInfo"
				if contactInfo is not missing value and contactInfo is not equal to "empty" then exit repeat
				if triedInfo is greater than 5 then tell me to error "Contact info not found" number 1
				set triedInfo to triedInfo + 1
				delay 2
			end repeat
			
			set triedExport to 0
			repeat
				set readyExport to do JavaScript "document.readyExport"
				if readyExport is "complete" then exit repeat
				if triedExport is greater than 30 then tell me to error "Something went wrong cleaning the page" number 1
				set triedExport to triedExport + 1
				delay 1
			end repeat
			
		end tell
	end tell
	
	tell application "Safari"
		activate
	end tell
	
	delay 0.5
	
	tell application "System Events"
		tell process "Safari"
			tell group 2 of toolbar 1 of front window to ¬
				repeat until exists (first button where its accessibility description = "Reload this page")
					delay 0.5
					keystroke "." using {command down}
				end repeat
			delay 0.5
			keystroke "s" using {command down, shift down, option down}
			repeat until exists sheet 1 of window 1
				delay 1
			end repeat
			if (count of argv) is equal to 3 then
				keystroke "g" using {command down, shift down}
				repeat until exists sheet 1 of sheet 1 of window 1
					delay 0.02
				end repeat
				delay 0.5
				tell sheet 1 of sheet 1 of window 1
					set value of combo box 1 to (item 3 of argv)
					click button "Go"
				end tell
				delay 2
			end if
			
			delay 3
			
			tell sheet 1 of window 1
				set value of text field 1 to (item 2 of argv)
				-- click button "Save"
			end tell
			delay 0.5
			key code 36
			delay 5
		end tell
	end tell
	
	tell application "Safari"
		close tabs of front window
	end tell
	
	return contactInfo
	
end run

And this is the Javascript used (save_profiles.js)

async function sleep(ms) {
	return new Promise(resolve => setTimeout(resolve, ms));
}

async function processPage() {
	document.readyExport = "waiting";
	document.contactInfo = "empty";

	// acceptCookies
	const buttons = Array.from(document.querySelectorAll('button'));
	const cookieButton = buttons.find(e => e.innerText.includes('Accept cookies'));
	if (cookieButton) {
	    cookieButton.click();
	}

	// cleanPage
	ads = document.getElementsByClassName('scaffold-layout__ad')[0]; if (ads) { ads.remove() }
	ad = document.getElementsByClassName('ad-banner-container')[2]; if(ad){ad.style.display='none'}
	layout = document.getElementsByClassName('scaffold-layout__row scaffold-layout__content scaffold-layout__content--main-aside')[0]; if(layout){layout.className = 'scaffold-layout__content--main'}
	layout = document.getElementsByClassName('scaffold-layout__inner scaffold-layout-container')[0]; if(layout){layout.className = 'scaffold-layout__content--main'}
	footer = document.getElementsByClassName('global-footer')[0]; if(footer){footer.style.display='none'}
	overlay = document.getElementById('msg-overlay'); if(overlay){overlay.style.display = 'none'}
	nav = document.getElementById('global-nav'); if(nav){nav.style.display = 'none'}
	sticky = document.getElementsByClassName('pv-profile-sticky-header'); if(sticky){ for(i=0;i<sticky.length;i++) { sticky[i].style.display = 'none'} }
	aside = document.getElementsByClassName('scaffold-layout__aside')[0]; if(aside){aside.style.display = 'none'; }
	main = document.getElementById('main'); if(main){main.style.width = '100%' }
	outlet = document.getElementsByClassName('authentication-outlet')[0]; if(outlet){outlet.style.paddingTop = "0px"; }
	bottom = document.getElementsByClassName('scaffold-layout scaffold-layout--breakpoint-none scaffold-layout--main-aside scaffold-layout--static')[0]; if(bottom){bottom.style.marginBottom = "0px"; }

	// getContactInfo
	document.contactInfo = "empty";
	contact_info = { 'email': [], 'phone': [], 'website': [], 'twitter': [], 'connected': "", 'photo_url': "" };

	ci_link = document.querySelector('a[href*="/detail/contact-info"]');
	await sleep(500);
	if(ci_link) {
		ci_link.click();
		await sleep(1500);

		el_emails = document.querySelectorAll('section[class*="email"] a[class*="pv-contact-info__contact-link"]');
		for(i=0; i < el_emails.length; i++) {
			email = el_emails[i].href.substring(7);
			contact_info['email'].push(email);
		}

		el_phones = document.querySelectorAll('section[class*="ci-phone"] li[class*="pv-contact-info"] span[class~="t-black"]');
		for(i=0; i < el_phones.length; i++) {
			phone = el_phones[i].innerText.trim();
			contact_info['phone'].push(phone);
		}
		
		el_websites = document.querySelectorAll('section[class*="ci-websites"] li[class*="link"] a[class*="pv-contact-info__contact-link"]');
		for(i=0; i < el_websites.length; i++) {
			website = el_websites[i].href;
			contact_info['website'].push(website);
		}

		el_twitters = document.querySelectorAll('section[class*="ci-twitter"] li[class*="pv-contact-info"] a[class*="pv-contact-info__contact-link"]');
		for(i=0; i < el_twitters.length; i++) {
			twitter = el_twitters[i].href;
			contact_info['twitter'].push(twitter);
		}

		el_connected = document.querySelector('section[class*="ci-connected"] span[class~="t-black"]');
		if(el_connected) {
			contact_info['connected'] = el_connected.innerText;
		}
	
		el_photo_url = document.querySelector('img[class*="pv-top-card__photo"]'); 
		if(el_photo_url) {
			contact_info['photo_url'] = el_photo_url.src;
		}
			
		ci_close = document.querySelector('button[data-test-modal-close-btn]');
		if(ci_close) {
			ci_close.click();
		}
	
		document.contactInfo = JSON.stringify(contact_info);
	}

	// scrollPage
	for(i=0; i < 7; i++) {
		window.scrollTo(0,i*document.body.scrollHeight/6);
		await sleep(500);
	}


	// expandSections
	elements = document.querySelectorAll(`button[aria-expanded="false"][aria-controls*="expandable-content"]:not([class*="global-nav"]`);
	for (i=0; i < elements.length; i++) {
		elements[i].click();
		await sleep(350);
	}

	// expandProfiles
	elements = document.querySelectorAll(`button[aria-expanded="false"][class*="-section"],button[aria-expanded="false"][aria-controls*="recommendation-list"],button[aria-expanded="false"][aria-controls*="skill-categories"]`);
	// elements = document.querySelectorAll(`button[aria-expanded="false"][class*="pv-profile-section__see-more"],button[aria-expanded="false"][class*="pv-skills-section__additional-skills"]`);
	while(elements.length > 0) {
		for (i=0; i < elements.length; i++) {
			elements[i].click();
			await sleep(350);
		}
		elements = document.querySelectorAll(`button[aria-expanded="false"][class*="-section"],button[aria-expanded="false"][aria-controls*="recommendation-list"],button[aria-expanded="false"][aria-controls*="skill-categories"]`);
	}

	// scrollPage
	for(i=0; i < 6; i++) {
		window.scrollTo(0,i*document.body.scrollHeight/10);
		await sleep(500);
	}

	// expandSeeMoreText
	elements = document.querySelectorAll(`button[aria-expanded="false"][class*="inline-show-more-text__button"]:not([class*="global-nav"])`);
	for (i=0; i < elements.length; i++) {
		elements[i].click();
		await sleep(100);
	}

	window.scrollTo(0,document.body.scrollHeight);
	await sleep(500)
	window.scrollTo(0,0);

	document.readyExport = "complete";

	return contact_info
}

processPage();
1 Like

OK :laughing: Great work!

Something like this shouldn’t be necessary for Apple Developler URLs, I just looked it up, I have over 500 PDFs and I definitely wouldn’t have captured them the way I’m doing it if it had been that difficult in the past.

Did you try printing the HTML doc to PDF? That might be more reliable than capturing them with DT’s tool.

Yes, it of course works but the result doesn’t look as good as the „real“ PDF

The next release will improve clipping paginated & single-page PDF documents, the results should be better especially in case of dynamic websites.

3 Likes

Excellent to hear @cgrunenberg - looking forward to the improvements!

I’ve been meaning to write that I’ve also started using the Export to PDF facility in Safari, after discovering recently that it started working again to produce unpaginated PDFs [1]. The output has been excellent, and the approach has the advantage of saving what you actually see in your Safari window (e.g., allowing ad blockers to do their thing). Your method adds scrolling the page in Safari before the export, which is a good idea for getting lazy-loaded elements to appear.

[1] I used to use the Safari export, then for some years had to stop because it no longer produced single-page PDFs – in fact, for a while I used a free utility called Paparazzi to get single-page PDFs, back in the days when I used Evernote (and I even wrote a utility to automate the process). Then at some point, the capability in Safari must have been restored, because I noticed only recently that Export to PDF works as hoped, at least on macOS 10.13.6 (which I know is ancient, and I don’t know if it works the same on later macOS versions). I haven’t changed my DEVONthink code yet, though.

I’m looking forward to trying out the new capabilities in the upcoming new version of DEVONthink mentioned by @cgrunenberg !

URLs of problematic dynamic websites not requiring a login would be great, thanks.

Could someone please check whether Apple Developer URLs display correctly in DEVONthink? (Don’t want to create a new thread and the question is somehow related to this one)

  1. Create a bookmark for https://developer.apple.com/documentation/foundation/nsstring?language=objc

  2. Open bookmark

What do you see?

That’s with and without JS activated (I generally have JS off in DT; turning it on and refreshing made no difference, the result remained the same).

Thanks! Seems like it’s necessary to restart DEVONthink before the change takes effect.

Here’s what I see since some time. When the page loads it’s white for a short time and then I see this:

If I scroll down I get this:

I’ve no idea what happens.

After capturing via Safari didn’t work reliable anymore (I suspect due to changes Apple made. Around the time capturing stopped working reliably Apple changed the site’s layout so I suspect this broke it) I developed a very nice script that I used for some weeks to capture from within DEVONthink. This worked perfectly, it was almost too good. Now it’s useless as I can’t properly view Apple URLs anymore. No idea why

I can capture the website you named, both as HTML and as a beautifully formatted PDF (both from Safari 14.1.2 using sorter

1 Like

The problem is that capturing Apple Developer documentation via Safari doesn’t work reliable (anymore) over here. I used to capture this way and it almost always worked but at some point it became very unreliable. Sometimes it took up to 5 or more times till I got a proper PDF :frowning:

No idea why you can view the site and I can‘t. Just rebooted and didn’t open any app but DEVONthink. Stil the same result.

(Sorry, yes, I’ve just scrolled up and read through this post, so I realise capturing successfully once is not worth much). For what it’s worth, I’ve just restarted DT (for you - only for you, Pete - I’ll have to reopen all my databases and enter all my passwords :stuck_out_tongue_winking_eye: [Edit: oh heavens, there’s a risk you won’t understand my humour - nobody ever does - so I’ve replaced “…” with “:stuck_out_tongue_winking_eye:”]) with JS active, and the bookmark loads no problem:

Ist something blocking things your end? Pi-hole etc.?

Again, for what it’s worth, I’m on macOS 11.5.1

1 Like

Wow, thank you! No worries, understood it :grinning:

There‘s nothing blocking as I turned off AdGuard‘s „start after login“.

I have a suspicion, but it’s probably very unrealistic. The script does the following:

  • extracts all links of PDF
  • searches if a URL is already in the database
    • if yes: replicate the record into a new group
    • if no: create a bookmark in this new group

This way I can select e.g. the NSString class’s PDF and get everything I already have and everything I could capture. I then look through the created bookmarks and capture what looks promising and afterwards delete the group.

The problem now could be (but again, probably very unrealistic) that I not only delete the group when I’m done. Quite often I deleted it just to get an updated state (in order to prevent capturing the same URL twice), i.e. see replicants of newly captured PDFs instead of the bookmarks. As this worked so well I captured some hundred PDFs and created a lot more bookmarks in the process. Did I maybe violate the terms of use? Doesn’t make any sense, does it? I can view the site in Safari …

Open the bookmark… in DEVONthink?

Light Mode

Dark Mode - Use dark background for documents enabled

Dark Mode - Use dark background for documents disabled

Do you have anything specified in Preferences > Web > Style Sheet?


PDF Captured from Safari 14.1.2 in Catalina and Big Sur…

Yes. It displays correctly in Safari and shows this strange behavior in DEVONthink.

No.

Only difference between your (and Blanc‘s) setup and mine seems to be

  • the macOS as I‘m still on Mojave.
  • I created a lot of bookmarks in short time

I have confirmed the behavior on macOS Mojave.

I’m curious why you’re not upgrading your OS.

2 Likes

Thank you very much!

I didn’t upgrade as I’m still using a mid-2012 MacBook Pro (you used one too, I think) and Catalina was not really what I wanted to use. I’ll get a new mac but don’t want to get it now as I don’t like to use the first generation (M1) of anything. Also hoping that a coming MacBook allows to use multiple external monitors without additional hardware.

1 Like