Is capturing PDFs from developer.apple.com broken?

It seems it’s not possible anymore to reliably capture PDFs. It sometimes works, sometimes fails. Sometimes it works on the second try. Or the third. Or the fourth. I even had to capture PDFs more than 5 times until I got a proper result.

This does not happen for the first time, I’ve seen this for some weeks now. Sometimes it’s sufficient to restart DEVONthink. Sometimes even rebooting doesn’t help - or it seems to have helped but fails again after some time. It’s super frustrating.

This happens with Clip to DEVONthink and also with PDFs created via AppleScript.

Example URL:

https://developer.apple.com/documentation/foundation/url_loading_system?language=objc

Example result:

Expected result:

Edit: It seems to work reliably if the URL is first opened in DEVONthink. I captured 3 times with success from within DEVONthink and 3 times the same URL failed when captured from Safari.

Edit 2: It works from within DEVONthink with Paginated PDF but fails with Clip to DEVONthink

1

Dynamic websites which load the contents on demand (e.g. after user interaction like scrolling or mouse movement) might not clip the desired result (e.g. missing images or contents).

That doesn’t really explain why it sometimes works for all Apple Developer URLs and sometimes doesn’t work or only after several tries. And why restarting DEVONthink often solves it.

I open those sites but don’t interact with them, i.e. I just capture them. So what I do is always the same, but the results differ. It also doesn’t explain why it’s most often the title and the declaration that’s missing. And if it were because of dynamic content: why does it sometimes work, sometimes not? I really do nothing else than opening the URL, I always read the PDF in DEVONthink, not the site in Safari.

Which browser and macOS version do you use? And how do you activate the Sorter? Via the Clip to DEVONthink extension?

Timing issues? I’m not saying that the problem is related to dynamically loaded/generated content. But for the case of the page that you quoted, I can see this


apart from a bunch of broken CSS styles :frowning:
and if you look at the web “page” proper, you’ll see that it is only a bunch of empty div elements. They get filled in “later” (i.e. at DOMContentLoaded or load or whenever …) by JavaScript functions.
So if one of these functions didn’t run or if the server were slow when you capture, it might well be that the document is not complete and only part of it gets captured. I’m wondering if “print to DT” might give better results, supposing that the browser’s built-in print function will wait until the page is fully loaded – which DT for obvious reasons doesn’t know about.

2 Likes

Yes, via the Extension. But it also happens when I use the share menu in XCode. And it happens with an AppleScript in a Smart Rule.

Safari 14.1.1
macOS 10.14.6

I would say possible, but that wouldn’t explain that it also happens via AppleScript.

As the source of the current HTML page from the browser is used, maybe it depends on the current state of the webpage in the browser? So far I couldn’t reproduce this (on Big Sur).

Don‘t think so. After it started to fail in the past weeks I often keep the tab open. One of the next tries should work if it were that, I think.

I’m not seeing an issue with creating a PDF with the browser extension - paginated or one page - for that URL.

The browser extension reloads the page before clipping so some variability would not be unexpected.
And the AppleScript command is also downloading content, so I don’t find the behavior unusual.

Is there no mechanism built into the extension and the AppleScript command that makes sure that the content is fully loaded?

@cgrunenberg would have to respond on the AppleScript mechanism as I don’t know if it’s something similar to curl or wget or what.

From what I know of the browser extension, there is nothing that checks if the page is fully loaded. And even if it did, that’s still no guarantee as there are plenty of pages online that don’t load figures until you’ve scrolled to them.

The task actually wait until no more data is loaded (plus a small delay). In case of dynamic websites loading stuff on demand/user interaction this might fail though (and some dynamic websites can even scroll endlessly).

Yes I understand that. But look at the captures in my first post. It‘s the title and the declaration that‘s missing, both at the top of the site. Isn‘t it strange that they‘re missing?

Yes, but the Apple Developer sites are no such sites, or? I don‘t see any dynamic content there, or is responsive design (I think of resizing elements) dynamic content?

The latest one is dynamic (see @chrillek’s reply). E.g. disable JavaScript and you will only get an error message:

1 Like

If I have the DT window fairly small when I click on the import button, DT won’t scan a full page. The cure is quit DT, restart, size the window “wide enough” and then click the import button.

When that happens, I see a partial page as in the example in this thread.

I’ve just spent a few nights figuring out how to capture Linkedin pages (also dynamically loaded content) to PDF. It’s a mess basically and you’ll have to resort to hacks to get some kind of result (and even that’s not reliable a 100% of the time). I’ll share my script when I’m happy with them, but you might want to take a look at @mhucka’s PDF capturing scripts (hacks) that incorporate some scrolling to get better PDF output: devonthink-hacks/auto-convert-web-page-to-PDF at main · mhucka/devonthink-hacks · GitHub

1 Like

Hmmm … I’m not so sure. If rendering a web page requires loading a lot of resources, some of which come from different domains than the parent page and from different CDNs, then it seems to me that no matter what mechanism you use to access the page, timing issues can come into play because they are due to factors outside of your computer. Also, different users around the world may see different elements appear at different times depending on their geographic location (which affects which CDN servers are reached), which can explain why different people see different results when they try to reproduce the problem.

1 Like

Makes sense. What I don’t get is that it depends on the method used:

I used the same URL and did not reload it. It worked every time via Paginated PDF and failed via Clip to DEVONthink.

Edit: Ah, I think I got it:

Does that mean Paginated PDF takes the current DEVONthink tab’s source, i.e. it doesn’t reload? That would explain the difference.

@cgrunenberg If it’s a timing issue wouldn’t an AppleScript option that allows for a “delay” make sense?