Problematic webpage to store in DT

tja · November 10, 2024, 8:51pm

I try to get the following description for ripping PAL DVDs to h264 into DEVONthink.

I tried the Mac, both with Safari save and export options, DEVONthink with paginated and not-paginated PDFs with and without clutter and I also tried to store the page on a iPads, as this sometimes give the best result:

Whatever I try, I cannot get it right.
The PDFs are either black in large parts or (for clutter free) lack the images.

The “best” version was from Export to PDF within Safari on the Mac, but this is not paginated and difficult to use, also it did not get most of the images for the second page (converting to h265, which is linked at the beginning)

Any idea how to get a straight PDF from this?

msteffens · November 10, 2024, 9:25pm

Have a look at the printfriendly browser extension that’s available for Chrome, Firefox & Safari. From my experience, it usually produces workable results when converting web pages to paginated PDFs.

troejgaard · November 10, 2024, 9:50pm

Looks like a website with some bad/weird coding.

For example, it doesn’t use proper header elements for some reason. And instead of a <details>/<summary> element, it uses <span> and some <div> with custom classes.

Safari’s reader mode parses it mostly okay, but loses most of the images. Sometimes it doesn’t, but that happens most of the time for me. I never understood why… but looking with the Web Inspector, it looks like all of the images have these attributes: loading="lazy" decoding="async" – I suspect that is the reason. (It’s just, even when I load all the images in the reader view, they still don’t print to PDF)

For problematic pages, I clip as markdown, clean it up manually if necessary, and convert from markdown to PDF. For this page, the markdown result from DEVONthink’s web clipper is mostly okay. Problem points:

Not all <br>s are parsed correctly. (Notably the “To convert to… CLICK HERE!” lines and the section “THE SHORT VERSION”.)
All the images appear twice, with the first being some empty container of sorts – I often see this.
The alt text is parsed as figcaption, where the actual page has none. This is also common. I would delete it, or correct it to actully be alt text.

Brett Terpstra’s Marky the Markdownfier does a bit better. It correctly parses the <br> tags, and it doesn’t use the alt image text. But every image is still double, and now they both actually render. He recently gave Marky an overhaul for a 2.0 version. Blog post with details: Marky 2.0 - BrettTerpstra.com.

If everything else fails, I select all the content I want to clip and use the DEVONthink 3: Create Markdown Document System Service. This always requires some cleanup. I do this often enough that I have developed some RegEx habits to speed up the cleaning. And I have actually stopped converting much of it to PDF, since I like the ability to reflow the content. I can always convert to PDF when I need it.

One more thing:

This is not a second page, but a different article/post, published a year later. You need to clip it separately no matter the approach.

tja · November 10, 2024, 11:30pm

Thanks, I am going to check this out!

tja · November 10, 2024, 11:35pm

Thanks, this is from the “Marked 2” people - I know this from earlier times.
Will check out too!

About the “second page”, I meant the second page that I wanted to archive
I tend to buy more and more DVDs (again) and want to rip them.

I tried DT Markdown too, but that did not look good to me …

troejgaard · November 11, 2024, 12:47am

Alright, I wasn’t sure about the second page, so I thought it was good to point out

I’m pretty sure the only person working on Marked 2 is Brett

In what way did DEVONthink’s markdown clipping not look good? I can’t tell if we get different results. Did you just look at the rendered output, or did you look at the actual markdown?

As I said, there are some small parsing errors in my DT result. But the markdown itself is pretty quick and easy to clean up. At least the <br> tags are actual line breaks – sometimes they disappear completely. There is just missing a single space at the end of the line.
(If you want to replicate the coloured text on the website, there is a little more work to do, but it’s doable.)

I have never ripped a DVD, so I can’t help with that. But if you care about the metadata/MP4 tags, I think you will like the application Subler. It can even fetch the metadata for many films and TV series.

troejgaard · November 11, 2024, 1:09am

Because I often want markdown instead of PDF, I forgot that I already have the PrintFriendly bookmarklet in my browser. Thanks for reminding me. I can report that it works very nicely with this site!

cgrunenberg · November 11, 2024, 6:45am

This works for me:

Close all banners (cookies etc.). Scroll down til the end of the page in Safari. This ensures that all dynamic contents are loaded.
Print the page to DEVONthink

troejgaard · November 11, 2024, 10:48am

Like @tja, I also just got a lot of black and no text using the standard print command. But now I see… It’s because I had “Print Backgrounds” enabled in the print settings! If I disable that, the result is much better. (It still includes the sidebar – the PrintFriendly extension msteffens shared removes this.)

It’s always good to check how “Print Backgrounds” affects the result. I forgot to this time. When disabled, many elements either completely disappear, or have to too low contrast to read. Sometimes that is desirable because the site doesn’t have any print styling in the CSS – so it “removes” a bunch of stuff you want to remove. But sometimes part of the content you want disappears too.

This is the first time I see a background completely override the text

Another tip: If a page has lots of extra elements (or just a sidebar), it can make a big difference to adjust the window size before you print. Make it smaller/narrower (enough for the responsive layout to change), and some pages print much nicer. For this page, if you make the window narrow enough, the sidebar also disappears from the print output.

rmschne · November 11, 2024, 12:07pm

In addition to what @cgrunenberg says above, I also show the pages in Safari’s Reader view, then print that. For the web pages I read, it normally works well. Sometimes not all the graphics comes thru, but then I just revert back to regular view, or decide to just not save it.

I save too much, actually.

I also use the app “PDF Squeezer” to get 99% (for pages with big image files) or less compression. Works well. I don’t need “perfect” images, and if so, I just “weaken” the compression.

troejgaard · November 11, 2024, 2:16pm

I save too much, actually.

Guilty as charged I try to move much to a “Junk Drawer” database, but my Global Inbox is a bit too large.

I used Safari’s Reader View for a long time, sometimes still do. But if there are images, my experience is – more often than not – that most of them don’t come through in print, even if I make sure to load them in the reader view. Plus, after fine-tuning my own CSS, I prefer the PDF output from that. In comparison I find Safari’s font-size a bit too large, and the page margins a bit too narrow.

Like you I make sure to compress the PDFs. Sometimes the file size is absolutely ridiculous! It can be easy to save 10-50 MB (or more, even), and it does stack up. This is another reason to like markdown, the files are much smaller. (Of course images still take space).

I have looked at PDF Squeezer, but I don’t know if it’s worth the investment for me. I use PDF Expert; it’s compression works quite well, even if the options are minimal. Last time I checked, the difference in file size wasn’t that much. The main attraction of Squeezer is to me the automation options