Why some static webpages are not clipped completely in pdf format?

Hi everyone, today when I was clipping some static webpages under University of Tübingen, I realized some pages are not saved in complete form using pdf format.

Some of the problematic pages are:

When clipping, I performed the following procedure:

  1. Clip the Safari webpage using Sorter as a bookmark
  2. Open the bookmark in DTP’s builtin browser
  3. Check the rendering looks same as in Safari
  4. Clip by clicking the gear icon next to the address bar, choose PDF (one page)

Normally, if I understand correctly, the saved pdf should look basically same as in browser. But not this time. The pages were not clipped completely, and some layout was wrong.

Note that all linked pages are basically static. They renders normally when javascript is disabled.

I also tried to firstly save pages as webarchive, then convert to pdf (via context menu) or clip as pdf (via gear icon) with no avail.

Sample incomplete pdf for Fachschaften | University of Tübingen
Fachschaften - University of Tübingen.pdf (146.8 KB)

As you can see, original page is laid out in two columns, while the clip has only one column. And the page is incomplete.

Why can’t these static pages be clipped completely? Is there a fix?

There is a number of threads discussing (various aspects of) this issue - for example this one.

Hi @mksBelper , thanks for replying. I just read the linked thread. But my case is different. All pages that I’m having problem capturing are basically static, without dynamic JS loading content. I can confirm all 3 pages renders correctly when javascript is disabled (in DTP and Safari). So it’s not a problem of dynamic loading.

1 Like

Welcome, @naitree :slight_smile:

Understood. If you search for ‘web clipping’ you’ll find a number of parallel discussions, the totality of which leads me to believe that improvements across the board in this area/functionality are in the offing in future releases.

Good luck!

1 Like

Thank you, I’ll look into more threads, trying to find some clues to this.

If I take DT out of the equation with, say, the Fachschaften | University of Tübingen link and Export it as a PDF from Safari, I get the same result as you do in your (146.8K) file.

That makes me wonder whether perhaps any export/conversion to PDF will always result in the format you’re seeing.

Some resource (CSS, image) - as opposed to a (java)script failing to load, perhaps?

How did you export as PDF from Safari? I just tried to export from Safari via File > Export as PDF menu item, and it produced a complete page, except most links are not clickable anymore (probably a glitch of Safari PDF export).

Fachschaften | University of Tübingen via Safari.pdf (120.6 KB)

Yes - by File > Export as PDF.

The page displayed in the same way as yours did: a single column instead of the two columns etc.

All this is a function of the tool which converts the web page to PDF. I vaguely recalled Adobe Acrobat could save as PDF with active links. Ino longer have that product in use or installed on my Mac so I cannot test for you. I found Converting web pages to PDF, Adobe Acrobat which suggests that it does it along with a lot of other things. Perhaps give it a try if retaining links is important to you. Adobe Acrobat surely can be integrated with DEVONthink.

If by “static” you mean “not depending on JavaScript”, you’re right. But the guys and gals in Tübingen went a step further: They added media queries to their style sheets. So that when you print e.g. Admission of International Students | University of Tübingen the layout is drastically changed (for example, there’s only one column now).

Although the PDF output does not look like on the screen, it probably looks exactly line Uni Tübingen’s web developers wanted it to look in print. In my opinion, this has nothing to do with DT. You could try to disable the meda queries in the style sheet. Or save as HTML or MD.

3 Likes

Thanks for suggestion. I just tried the Acrobat way of doing this. It… kinda works, with various other major/minor annoyance (e.g., images are lost for whatever reasons), just like every other half-made Acrobat features :wink:

My hunch is, as per @chrillek, the Uni Tübingen’s web developers are the first point of call to resolve your expectations.

1 Like

Which is? I do not really understand how you get from web page to Acrobat, but then I do not really know this software.

Adobe Acrobat has features to pull in web pages to PDF. I sort of recall they can do entire web sites also, but I’m not sure and can’t look for sure. Nor interested!!!

1 Like

On a related note: Trying to capture HTML 1:1 in print is a hopeless idea. Think of animated gifs. Think of dynamically generated content. Think of transparency. CSS animations… whatever. Print has fixed dimensions, screens have … not really, given that you can shrink/grow browser windows, change font sizes etc.

Screen is screen and print is print. That’s what some people accept and that’s why they provice print style sheets.

1 Like

It was drilled into me when I first learned HMTL and CSS (which hurt my head) was that “HTML is a suggestion to the browser as to how to render it.”

1 Like

Acrobat has this Create PDF feature that it will download the webpage given a url then convert it to pdf.

I think that this idea has changed quite a bit. Of course, it is still possible for users to change their preferred font and font size and have them override the ones set by a style sheet. But browsers nowadays are very much following the orders of CSS and HTML. What (hopefully) has changed is the idea that a “pixel-accurate layout” is possible at all.

I’m agree with you and @rmschne that media query (@media print) does affect browser’s Print and DT’s Clipping as Paginated PDF. Actually they both output the same page layout and conform media query as expected.

But I’m not so sure that media query should affect DT’s Clipping as One Page PDF. As I understand it, it should generate pdf just like Safari’s File > Export as PDF, which always generates pdf that looks identical to how the webpage is rendered on screen, like a snapshot.

Because in this case Safari’s Export as PDF does generate pdfs that are complete and identical to on-screen rendering, I’m wondering maybe there’s a bug in DT’s implementation of Clipping as One Page PDF.

Maybe @cgrunenberg could provide some insight.

The differences between “Export as PDF” and “Print … to PDF” have been discussed already seven years ago:

It seems that Apple decided to built something into their browser that is … let’s say “peculiar”? If one prints to PDF, one can be fairly sure that the result does not depend on the browser. With this “Export to PDF” thingy, all bets are off. Does it ignore media conditions in style sheets? All of them? Some of them? Who knows. One of these Apple black boxes, it seems.

Pagination is not really the question here, given that many browsers do not implement all @page requests correctly anyway.

1 Like