Clipping to DEVONthink renders images missing or black

I have come across a curious problem, which occurs on some Web pages. When clipping these pages to DT as PDF or HTML the text is fine but all the images are missing. The Preview, which I think is using Webkit, in the Sorter or clipping tool shows the Web page correctly with all the images, but the final PDF or HTML files in the DT Inbox has no images, only white areas where the image should be.

If I render to a paginated PDF these white areas turn black. If I open such a PDF file in PDF Expert or the Apple Preview the problem is the same with missing images. This happens on all the Web browsers I use, Firefox, Chrome and Safari. If I print these Web pages to a PDF file and import it to DT the file is fine with all the images rendered correctly. On other Web sites I can clip pages with many images without any problems.

I have also tried to save the URL as a bookmark in DT and to do the conversion within DT, but with exactly the same negative result. So it seems to me that there is a problem with how the images are streamed from these Web pages and how the rendering program handles these streams.

I should add that I am on a Intel Macbook Pro from 2018 with 16 GB RAM running Ventura 13.6.7, which is the last MacOS supported on this platform. It is therefore quite possible that this problem is linked with my older hardware and OS.

However, I am curious to know if somebody has an explanation for why this happens and if it perhaps is me that has not configured DT correctly. I see that similar problems have been reported previously, but are things better today, for example on modern hardware with Apple silicon.

Posting an URL that exhibits the issue might help solve the problem faster.

1 Like

Here is the URL: Gjennombrudd i forskningen på inflammatorisk tarmsykdom (IBD) – NRK Trøndelag – Lokale nyheter, TV og radio
It is a Web page in Norwegian on the server of the Norwegian Broadcasting Corporation (NRK). I think there is no region lock on this material, but I might be wrong.

There are many popular sites that use dynamic content delivery, with images and even sometimes the text begin pulled from remote sources. Such behavior can affect clipping web content since there are now more servers involved. Also, some content is lazily loaded, so you need to scroll through the page to load it. Scrolling to the bottom of a page before clipping may help.

Also, the quality of the network you’re on, responsiveness of the remote server, etc. all can have an effect on clipping.

Try this: Add a bookmark oif the page to DEVONthink, then use Tools > Capture on it.

BRAVO! Tools > Capture works and renders a PDF+text of 30MB. So Tools > Capture must be different from context-click on the bookmark and doing Convert > to PDF. Many thanks for your help. If I run into such Web pages again I now know how to handle it. Perfect!

1 Like

You’re welcome!
The Clip To DEVONthink extension works far more often than it doesn’t, but as you can see we have other tools to assist in the process. :slight_smile:

Here’s what they send as their first image (the lovely lady):

<img sizes="(min-width:1180px) 767px, (min-width:720px) 67vw, 100vw" srcset="https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQoH9xJZRJqHXd0Y6v2vFOtg.jpg 80w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQzdlrs7GQPhPd0Y6v2vFOtg.jpg 160w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQZskBSNeXWiLd0Y6v2vFOtg.jpg 350w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQ8wTvwEu1zY_d0Y6v2vFOtg.jpg 450w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQYvzh9CCPfX7d0Y6v2vFOtg.jpg 650w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQJD1pQXWP-wLd0Y6v2vFOtg.jpg 1000w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQ9PJYp65E_xDd0Y6v2vFOtg.jpg 1200w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQBTwYWr3cE13d0Y6v2vFOtg.jpg 1600w, https://gfx.nrk.no/LyqbViBmhf8PFOnYhkJ8AQPJ8YuOBKri_d0Y6v2vFOtg.jpg 2000w" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" alt="Petra Cyvin StorĂĄs stĂĄr ute i en park og ser inn i kamera" title="Foto: Kirsti Kringstad / NRK">

Having a src attribute contain a 1x1 pixel GIF that is delivered as a data URI is crap. And wrapping an img in a div inside a figure is at least bad design. For the fun of it, I ran the page through the W3C’s validator:

https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.nrk.no%2Ftrondelag%2Fgjennombrudd-i-forskningen-pa-inflammatorisk-tarmsykdom-_ibd_-1.16939194

Borken HTML, too. As well as borken CSS (51 errors):

https://jigsaw.w3.org/css-validator/validator?profile=css3&warning=0&uri=https%3A%2F%2Fwww.nrk.no%2Ftrondelag%2Fgjennombrudd-i-forskningen-pa-inflammatorisk-tarmsykdom-_ibd_-1.16939194

Nine CORS errors in the Firefox console (neither Chrome nor Safari report those, though).

Interestingly, DT removes the srcset attribute completely from the img element. In my opinion, it shouldn’t be doing that when capturing a HTML document, @cgrunenberg? Anyway, what remains of the image now is crap: a 1x1 GIF in a data URI.

Even though you have found a way around the issue, these people should fix their site.

4 Likes

You would be forgiven for assuming that the website of a large and legitimate broadcasting corporation must be blessed with quality code – at the least, better code than that of an average content farm. That is surprisingly often not the case.

I was in fact shocked by the amount of (probably useless or harmful) JS code they included in the document literally. No link element, just script. And they use that to load CSS as well as other JS. It makes my toenails curl. Then they use invisible iframes secretely loading even more JS and at least one invisible svg. This is just a huge WTF.

OTOH, if it’s a public broadcasting company, they might be under-funded.

2 Likes

Another possibility is that they simply don’t care about codebase quality or website performance. They likely are aware they should have a website that (1) is capable of displaying content, (2) yields them ads revenue, and (3) looks sufficiently modern (for no news site would like to be associated with outdatedness.) As long as these goals are met, they conclude contracts with the IT firms charged with building the site.

Moldy CSS? Excessive JavaScript? Those are no one’s problem – as long as the content still (miraculously) loads in a browser.

@meowky and @chrillek I am as shocked as you. I would at least have expected that their Web framework would be of a professional standard and that the users, the journalists, just would need to dump their text and images on the system without any fiddling or “clever” tricks. I wonder if their Web pages are home grown or if they have farmed out the design to some cheap sweatshop somewhere. That they are underfunded is of course true. They are running on subscription funding and some government funding. They details of which I must admit I don’t know. But many thanks for the analysis, which shows why the DT clipping failed. The capture on the other hand hoovered up every single bite and render it perfectly. The PDF gets quite large of course, but at least every single comma is there.

@FrodeW I don’t want to tell you nonsense or things you already know. But the question is: Do you really need a PDF? Importing websites as webarchive usually works better for me and I can add text (comments) or other things very easily - much easier than with PDF.

Given some of the variables involved in just viewing a web page, I sometimes think it’s near miraculous clipping works at all :wink:

1 Like

I could save web archives and I did in the past, but for many years I have now used PDF files and I find that suits me fine. Usually I am only saving clutter-free text, saving files with images happens less often, however on scientific pages with diagrams etc. I have to save complete pages. I hardly ever make notes or annotation on these pages. Notes I only make in my note taking application or in Scrivener, where I will refer to the Web page and add quotes or extracts if needed. With PDFs I find I can easily extract a few pages and send to people, while with a web archive I will need to extract the text and possibly some images and use Word or some other program to turn it into PDF, which most of my research colleagues prefer. However, I think it all depends on how you work.

Totally agree with this, and honestly, DT’s web clipping to PDF (and other formats) is so good, that even if I were to switch products, I would install DT just to get the web clipper!!!

Not that I could switch even if I want to, DTTG is so good when coupled with DT. But I really want to single out how good that web clipper really is.

Welcome @JasonIron

Thanks for the kind words. They’re appreciated. :slight_smile:

1 Like

I agree with you that DT’s web clippings are superb and that it works so well on the majority of web sites is quite amazing. Especially, as the web creators seems to bend backwards to try to make it as difficult to clip their contents as physically possible. Today I stumbled on a web page on Muckrock where even DT’s Tool > Capture > PDF fails to pick up one of the images. For those of you who are curious to see and try here is the URL:

The article is quite interesting and shows the problem of keeping and protecting our heritage for future generations.

Thanks! This works fine using the latest internal build.

And this will work too.

I’m missing the tape reel image converting the bookmark to PDF within DT on Ventura.

Converting to formatted note also misses the image, but is less apparent.