It seems it’s not possible anymore to reliably capture PDFs. It sometimes works, sometimes fails. Sometimes it works on the second try. Or the third. Or the fourth. I even had to capture PDFs more than 5 times until I got a proper result.
This does not happen for the first time, I’ve seen this for some weeks now. Sometimes it’s sufficient to restart DEVONthink. Sometimes even rebooting doesn’t help - or it seems to have helped but fails again after some time. It’s super frustrating.
This happens with Clip to DEVONthink and also with PDFs created via AppleScript.
That doesn’t really explain why it sometimes works for all Apple Developer URLs and sometimes doesn’t work or only after several tries. And why restarting DEVONthink often solves it.
I open those sites but don’t interact with them, i.e. I just capture them. So what I do is always the same, but the results differ. It also doesn’t explain why it’s most often the title and the declaration that’s missing. And if it were because of dynamic content: why does it sometimes work, sometimes not? I really do nothing else than opening the URL, I always read the PDF in DEVONthink, not the site in Safari.
apart from a bunch of broken CSS styles
So if one of these functions didn’t run or if the server were slow when you capture, it might well be that the document is not complete and only part of it gets captured. I’m wondering if “print to DT” might give better results, supposing that the browser’s built-in print function will wait until the page is fully loaded – which DT for obvious reasons doesn’t know about.
@cgrunenberg would have to respond on the AppleScript mechanism as I don’t know if it’s something similar to curl or wget or what.
From what I know of the browser extension, there is nothing that checks if the page is fully loaded. And even if it did, that’s still no guarantee as there are plenty of pages online that don’t load figures until you’ve scrolled to them.
The task actually wait until no more data is loaded (plus a small delay). In case of dynamic websites loading stuff on demand/user interaction this might fail though (and some dynamic websites can even scroll endlessly).
I’ve just spent a few nights figuring out how to capture Linkedin pages (also dynamically loaded content) to PDF. It’s a mess basically and you’ll have to resort to hacks to get some kind of result (and even that’s not reliable a 100% of the time). I’ll share my script when I’m happy with them, but you might want to take a look at @mhucka’s PDF capturing scripts (hacks) that incorporate some scrolling to get better PDF output: devonthink-hacks/auto-convert-web-page-to-PDF at main · mhucka/devonthink-hacks · GitHub
Hmmm … I’m not so sure. If rendering a web page requires loading a lot of resources, some of which come from different domains than the parent page and from different CDNs, then it seems to me that no matter what mechanism you use to access the page, timing issues can come into play because they are due to factors outside of your computer. Also, different users around the world may see different elements appear at different times depending on their geographic location (which affects which CDN servers are reached), which can explain why different people see different results when they try to reproduce the problem.