A delay might help in some cases. But it’s not a guarantee that the page is fully loaded, and it did not help against lazily loaded parts of the page.
Maybe to give you an idea: here’s what I’m doing to capture Linkedin pages. I run the following Applescript (triggered from another Python script that’s running a list of my profiles to capture from a .csv file). That script calls a Javascript at some point to ‘clean’ and scroll the Linkedin page before running “Export to PDF” from Safari. That’s what delivers the best results so far… (I’ll be creating a Github repository for these)
The command-shift-option-s command in that script is my shortcut for “Export as PDF” which seemed to work more reliable than clicking the menubar at some point (but might have been due to other issues)
on run argv
set contactInfo to missing value
set theCurrentDirectory to do shell script "pwd"
set theScriptFile to (POSIX file (theCurrentDirectory & "/save_profiles.js") as alias)
set theScript to read theScriptFile
tell application "Safari"
activate
if front window exists then
close tabs of front window
end if
open location (item 1 of argv)
delay 5
set bounds of front window to {0, 0, 1400, 1000}
tell front document
-- set readyState to ""
set triedState to 0
repeat
set readyState to do JavaScript "document.readyState"
if readyState is "complete" or readyState is "interactive" then exit repeat
if triedState is greater than 20 then tell me to error "Page is not loading" number 1
set triedState to triedState + 1
delay 1
end repeat
do JavaScript theScript
set triedInfo to 0
repeat
set contactInfo to do JavaScript "document.contactInfo"
if contactInfo is not missing value and contactInfo is not equal to "empty" then exit repeat
if triedInfo is greater than 5 then tell me to error "Contact info not found" number 1
set triedInfo to triedInfo + 1
delay 2
end repeat
set triedExport to 0
repeat
set readyExport to do JavaScript "document.readyExport"
if readyExport is "complete" then exit repeat
if triedExport is greater than 30 then tell me to error "Something went wrong cleaning the page" number 1
set triedExport to triedExport + 1
delay 1
end repeat
end tell
end tell
tell application "Safari"
activate
end tell
delay 0.5
tell application "System Events"
tell process "Safari"
tell group 2 of toolbar 1 of front window to ¬
repeat until exists (first button where its accessibility description = "Reload this page")
delay 0.5
keystroke "." using {command down}
end repeat
delay 0.5
keystroke "s" using {command down, shift down, option down}
repeat until exists sheet 1 of window 1
delay 1
end repeat
if (count of argv) is equal to 3 then
keystroke "g" using {command down, shift down}
repeat until exists sheet 1 of sheet 1 of window 1
delay 0.02
end repeat
delay 0.5
tell sheet 1 of sheet 1 of window 1
set value of combo box 1 to (item 3 of argv)
click button "Go"
end tell
delay 2
end if
delay 3
tell sheet 1 of window 1
set value of text field 1 to (item 2 of argv)
-- click button "Save"
end tell
delay 0.5
key code 36
delay 5
end tell
end tell
tell application "Safari"
close tabs of front window
end tell
return contactInfo
end run
And this is the Javascript used (save_profiles.js
)
async function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async function processPage() {
document.readyExport = "waiting";
document.contactInfo = "empty";
// acceptCookies
const buttons = Array.from(document.querySelectorAll('button'));
const cookieButton = buttons.find(e => e.innerText.includes('Accept cookies'));
if (cookieButton) {
cookieButton.click();
}
// cleanPage
ads = document.getElementsByClassName('scaffold-layout__ad')[0]; if (ads) { ads.remove() }
ad = document.getElementsByClassName('ad-banner-container')[2]; if(ad){ad.style.display='none'}
layout = document.getElementsByClassName('scaffold-layout__row scaffold-layout__content scaffold-layout__content--main-aside')[0]; if(layout){layout.className = 'scaffold-layout__content--main'}
layout = document.getElementsByClassName('scaffold-layout__inner scaffold-layout-container')[0]; if(layout){layout.className = 'scaffold-layout__content--main'}
footer = document.getElementsByClassName('global-footer')[0]; if(footer){footer.style.display='none'}
overlay = document.getElementById('msg-overlay'); if(overlay){overlay.style.display = 'none'}
nav = document.getElementById('global-nav'); if(nav){nav.style.display = 'none'}
sticky = document.getElementsByClassName('pv-profile-sticky-header'); if(sticky){ for(i=0;i<sticky.length;i++) { sticky[i].style.display = 'none'} }
aside = document.getElementsByClassName('scaffold-layout__aside')[0]; if(aside){aside.style.display = 'none'; }
main = document.getElementById('main'); if(main){main.style.width = '100%' }
outlet = document.getElementsByClassName('authentication-outlet')[0]; if(outlet){outlet.style.paddingTop = "0px"; }
bottom = document.getElementsByClassName('scaffold-layout scaffold-layout--breakpoint-none scaffold-layout--main-aside scaffold-layout--static')[0]; if(bottom){bottom.style.marginBottom = "0px"; }
// getContactInfo
document.contactInfo = "empty";
contact_info = { 'email': [], 'phone': [], 'website': [], 'twitter': [], 'connected': "", 'photo_url': "" };
ci_link = document.querySelector('a[href*="/detail/contact-info"]');
await sleep(500);
if(ci_link) {
ci_link.click();
await sleep(1500);
el_emails = document.querySelectorAll('section[class*="email"] a[class*="pv-contact-info__contact-link"]');
for(i=0; i < el_emails.length; i++) {
email = el_emails[i].href.substring(7);
contact_info['email'].push(email);
}
el_phones = document.querySelectorAll('section[class*="ci-phone"] li[class*="pv-contact-info"] span[class~="t-black"]');
for(i=0; i < el_phones.length; i++) {
phone = el_phones[i].innerText.trim();
contact_info['phone'].push(phone);
}
el_websites = document.querySelectorAll('section[class*="ci-websites"] li[class*="link"] a[class*="pv-contact-info__contact-link"]');
for(i=0; i < el_websites.length; i++) {
website = el_websites[i].href;
contact_info['website'].push(website);
}
el_twitters = document.querySelectorAll('section[class*="ci-twitter"] li[class*="pv-contact-info"] a[class*="pv-contact-info__contact-link"]');
for(i=0; i < el_twitters.length; i++) {
twitter = el_twitters[i].href;
contact_info['twitter'].push(twitter);
}
el_connected = document.querySelector('section[class*="ci-connected"] span[class~="t-black"]');
if(el_connected) {
contact_info['connected'] = el_connected.innerText;
}
el_photo_url = document.querySelector('img[class*="pv-top-card__photo"]');
if(el_photo_url) {
contact_info['photo_url'] = el_photo_url.src;
}
ci_close = document.querySelector('button[data-test-modal-close-btn]');
if(ci_close) {
ci_close.click();
}
document.contactInfo = JSON.stringify(contact_info);
}
// scrollPage
for(i=0; i < 7; i++) {
window.scrollTo(0,i*document.body.scrollHeight/6);
await sleep(500);
}
// expandSections
elements = document.querySelectorAll(`button[aria-expanded="false"][aria-controls*="expandable-content"]:not([class*="global-nav"]`);
for (i=0; i < elements.length; i++) {
elements[i].click();
await sleep(350);
}
// expandProfiles
elements = document.querySelectorAll(`button[aria-expanded="false"][class*="-section"],button[aria-expanded="false"][aria-controls*="recommendation-list"],button[aria-expanded="false"][aria-controls*="skill-categories"]`);
// elements = document.querySelectorAll(`button[aria-expanded="false"][class*="pv-profile-section__see-more"],button[aria-expanded="false"][class*="pv-skills-section__additional-skills"]`);
while(elements.length > 0) {
for (i=0; i < elements.length; i++) {
elements[i].click();
await sleep(350);
}
elements = document.querySelectorAll(`button[aria-expanded="false"][class*="-section"],button[aria-expanded="false"][aria-controls*="recommendation-list"],button[aria-expanded="false"][aria-controls*="skill-categories"]`);
}
// scrollPage
for(i=0; i < 6; i++) {
window.scrollTo(0,i*document.body.scrollHeight/10);
await sleep(500);
}
// expandSeeMoreText
elements = document.querySelectorAll(`button[aria-expanded="false"][class*="inline-show-more-text__button"]:not([class*="global-nav"])`);
for (i=0; i < elements.length; i++) {
elements[i].click();
await sleep(100);
}
window.scrollTo(0,document.body.scrollHeight);
await sleep(500)
window.scrollTo(0,0);
document.readyExport = "complete";
return contact_info
}
processPage();
OK Great work!
Something like this shouldn’t be necessary for Apple Developler URLs, I just looked it up, I have over 500 PDFs and I definitely wouldn’t have captured them the way I’m doing it if it had been that difficult in the past.
Did you try printing the HTML doc to PDF? That might be more reliable than capturing them with DT’s tool.
Yes, it of course works but the result doesn’t look as good as the „real“ PDF
The next release will improve clipping paginated & single-page PDF documents, the results should be better especially in case of dynamic websites.
I’ve been meaning to write that I’ve also started using the Export to PDF facility in Safari, after discovering recently that it started working again to produce unpaginated PDFs [1]. The output has been excellent, and the approach has the advantage of saving what you actually see in your Safari window (e.g., allowing ad blockers to do their thing). Your method adds scrolling the page in Safari before the export, which is a good idea for getting lazy-loaded elements to appear.
[1] I used to use the Safari export, then for some years had to stop because it no longer produced single-page PDFs – in fact, for a while I used a free utility called Paparazzi to get single-page PDFs, back in the days when I used Evernote (and I even wrote a utility to automate the process). Then at some point, the capability in Safari must have been restored, because I noticed only recently that Export to PDF works as hoped, at least on macOS 10.13.6 (which I know is ancient, and I don’t know if it works the same on later macOS versions). I haven’t changed my DEVONthink code yet, though.
I’m looking forward to trying out the new capabilities in the upcoming new version of DEVONthink mentioned by @cgrunenberg !
URLs of problematic dynamic websites not requiring a login would be great, thanks.
Could someone please check whether Apple Developer URLs display correctly in DEVONthink? (Don’t want to create a new thread and the question is somehow related to this one)
-
Create a bookmark for
https://developer.apple.com/documentation/foundation/nsstring?language=objc
-
Open bookmark
What do you see?
That’s with and without JS activated (I generally have JS off in DT; turning it on and refreshing made no difference, the result remained the same).
Thanks! Seems like it’s necessary to restart DEVONthink before the change takes effect.
Here’s what I see since some time. When the page loads it’s white for a short time and then I see this:
If I scroll down I get this:
I’ve no idea what happens.
After capturing via Safari didn’t work reliable anymore (I suspect due to changes Apple made. Around the time capturing stopped working reliably Apple changed the site’s layout so I suspect this broke it) I developed a very nice script that I used for some weeks to capture from within DEVONthink. This worked perfectly, it was almost too good. Now it’s useless as I can’t properly view Apple URLs anymore. No idea why
I can capture the website you named, both as HTML and as a beautifully formatted PDF (both from Safari 14.1.2 using sorter
The problem is that capturing Apple Developer documentation via Safari doesn’t work reliable (anymore) over here. I used to capture this way and it almost always worked but at some point it became very unreliable. Sometimes it took up to 5 or more times till I got a proper PDF
No idea why you can view the site and I can‘t. Just rebooted and didn’t open any app but DEVONthink. Stil the same result.
(Sorry, yes, I’ve just scrolled up and read through this post, so I realise capturing successfully once is not worth much). For what it’s worth, I’ve just restarted DT (for you - only for you, Pete - I’ll have to reopen all my databases and enter all my passwords [Edit: oh heavens, there’s a risk you won’t understand my humour - nobody ever does - so I’ve replaced “…” with “”]) with JS active, and the bookmark loads no problem:
Ist something blocking things your end? Pi-hole etc.?
Again, for what it’s worth, I’m on macOS 11.5.1
Wow, thank you! No worries, understood it
There‘s nothing blocking as I turned off AdGuard‘s „start after login“.
I have a suspicion, but it’s probably very unrealistic. The script does the following:
- extracts all links of PDF
- searches if a URL is already in the database
- if yes: replicate the record into a new group
- if no: create a bookmark in this new group
This way I can select e.g. the NSString
class’s PDF and get everything I already have and everything I could capture. I then look through the created bookmarks and capture what looks promising and afterwards delete the group.
The problem now could be (but again, probably very unrealistic) that I not only delete the group when I’m done. Quite often I deleted it just to get an updated state (in order to prevent capturing the same URL twice), i.e. see replicants of newly captured PDFs instead of the bookmarks. As this worked so well I captured some hundred PDFs and created a lot more bookmarks in the process. Did I maybe violate the terms of use? Doesn’t make any sense, does it? I can view the site in Safari …
Open the bookmark… in DEVONthink?
Light Mode
Dark Mode - Use dark background for documents enabled
Dark Mode - Use dark background for documents disabled
Do you have anything specified in Preferences > Web > Style Sheet?
PDF Captured from Safari 14.1.2 in Catalina and Big Sur…
Yes. It displays correctly in Safari and shows this strange behavior in DEVONthink.
No.
Only difference between your (and Blanc‘s) setup and mine seems to be
- the macOS as I‘m still on Mojave.
- I created a lot of bookmarks in short time
I have confirmed the behavior on macOS Mojave.
I’m curious why you’re not upgrading your OS.
Thank you very much!
I didn’t upgrade as I’m still using a mid-2012 MacBook Pro (you used one too, I think) and Catalina was not really what I wanted to use. I’ll get a new mac but don’t want to get it now as I don’t like to use the first generation (M1) of anything. Also hoping that a coming MacBook allows to use multiple external monitors without additional hardware.