It seems it’s not possible anymore to reliably capture PDFs. It sometimes works, sometimes fails. Sometimes it works on the second try. Or the third. Or the fourth. I even had to capture PDFs more than 5 times until I got a proper result.
This does not happen for the first time, I’ve seen this for some weeks now. Sometimes it’s sufficient to restart DEVONthink. Sometimes even rebooting doesn’t help - or it seems to have helped but fails again after some time. It’s super frustrating.
This happens with Clip to DEVONthink and also with PDFs created via AppleScript.
Edit: It seems to work reliably if the URL is first opened in DEVONthink. I captured 3 times with success from within DEVONthink and 3 times the same URL failed when captured from Safari.
Edit 2: It works from within DEVONthink with Paginated PDF but fails with Clip to DEVONthink
Dynamic websites which load the contents on demand (e.g. after user interaction like scrolling or mouse movement) might not clip the desired result (e.g. missing images or contents).
That doesn’t really explain why it sometimes works for all Apple Developer URLs and sometimes doesn’t work or only after several tries. And why restarting DEVONthink often solves it.
I open those sites but don’t interact with them, i.e. I just capture them. So what I do is always the same, but the results differ. It also doesn’t explain why it’s most often the title and the declaration that’s missing. And if it were because of dynamic content: why does it sometimes work, sometimes not? I really do nothing else than opening the URL, I always read the PDF in DEVONthink, not the site in Safari.
Timing issues? I’m not saying that the problem is related to dynamically loaded/generated content. But for the case of the page that you quoted, I can see this
apart from a bunch of broken CSS styles
and if you look at the web “page” proper, you’ll see that it is only a bunch of empty div elements. They get filled in “later” (i.e. at DOMContentLoaded or load or whenever …) by JavaScript functions.
So if one of these functions didn’t run or if the server were slow when you capture, it might well be that the document is not complete and only part of it gets captured. I’m wondering if “print to DT” might give better results, supposing that the browser’s built-in print function will wait until the page is fully loaded – which DT for obvious reasons doesn’t know about.
As the source of the current HTML page from the browser is used, maybe it depends on the current state of the webpage in the browser? So far I couldn’t reproduce this (on Big Sur).
I’m not seeing an issue with creating a PDF with the browser extension - paginated or one page - for that URL.
The browser extension reloads the page before clipping so some variability would not be unexpected.
And the AppleScript command is also downloading content, so I don’t find the behavior unusual.
@cgrunenberg would have to respond on the AppleScript mechanism as I don’t know if it’s something similar to curl or wget or what.
From what I know of the browser extension, there is nothing that checks if the page is fully loaded. And even if it did, that’s still no guarantee as there are plenty of pages online that don’t load figures until you’ve scrolled to them.
The task actually wait until no more data is loaded (plus a small delay). In case of dynamic websites loading stuff on demand/user interaction this might fail though (and some dynamic websites can even scroll endlessly).
Yes I understand that. But look at the captures in my first post. It‘s the title and the declaration that‘s missing, both at the top of the site. Isn‘t it strange that they‘re missing?
Yes, but the Apple Developer sites are no such sites, or? I don‘t see any dynamic content there, or is responsive design (I think of resizing elements) dynamic content?
If I have the DT window fairly small when I click on the import button, DT won’t scan a full page. The cure is quit DT, restart, size the window “wide enough” and then click the import button.
When that happens, I see a partial page as in the example in this thread.
I’ve just spent a few nights figuring out how to capture Linkedin pages (also dynamically loaded content) to PDF. It’s a mess basically and you’ll have to resort to hacks to get some kind of result (and even that’s not reliable a 100% of the time). I’ll share my script when I’m happy with them, but you might want to take a look at @mhucka’s PDF capturing scripts (hacks) that incorporate some scrolling to get better PDF output: devonthink-hacks/auto-convert-web-page-to-PDF at main · mhucka/devonthink-hacks · GitHub
Hmmm … I’m not so sure. If rendering a web page requires loading a lot of resources, some of which come from different domains than the parent page and from different CDNs, then it seems to me that no matter what mechanism you use to access the page, timing issues can come into play because they are due to factors outside of your computer. Also, different users around the world may see different elements appear at different times depending on their geographic location (which affects which CDN servers are reached), which can explain why different people see different results when they try to reproduce the problem.