Capture Markdown from faz.net

Hello everybody,

I got the following problem:

When trying to clip/capture content from the faz.net website, I receive only some messed up lines of text like in the picture.

This happens quite exclusively on far.net no matter if behind paywall or not.

Is there any way to fix this or does an explanation exist, why content from faz.net behaves like this?

Os: Catalina 10.15.7
Safari 14.0.1
DT3: 3.6.1

Regards

Websites are not created equally. In fact, there are many ways people design and code them. This means a capture, especially one trying to extract an article as the clutter-free option does, isn’t always feasible.

Does it capture as a PDF without the clutter-free option?

I tried MD, MD/clutter free and PDF (cluttered) - no avail. The two markdown options gave exactly the same result as posted before (though on another page – this is basically surrounding noise, not the article itself). The PDF was simply blank.

This is from the free site, so no paywall, no login required.
The relevant text seems to be wrapped in an article element, as seems appropriate. As much as I agree that not all websites are created equal, extracting the text content from an article element should be fairly basic stuff. What can be seen in the MD snippets is actually the content of an inline style element contained in a div(yes, I’d consider that very bad style). Wouldn’t it be sensible to skip inline styles in any case?

I think it would depend on the particular situation.

I didn’t develop the clutter-free mechanism so I can’t speak too deeply on the underlying tech.

Print to PDF (instead of clipping) works with faz.net, Reader Mode active and not.

Clipping works with the print version of a FAZ page. When the printer icon is visible—in my case: not visible in Safari but in Firefox—just click it before clipping. If not, add ?service=printPreview to the URL.

It’s a pain that there is no one way to fetch the content of different webpages. At the moment I can’t get the content of zeit.de on iOS/iPadOS if not by the tortous Share to Print routine. Which loses the URL and thus leads to just more steps of copying and pasting the URL between Safari and DTTG. No, shortcuts are of no help.

Thank you. I will, respectively I do.

I really like the Markdown feature due to file size. A pdf is significant larger and not as handy to edit…

I was just hoping there maybe would be some top secret, hidden settings I just didn’t know about to fix the problem.

As I understand the discussion above between chrillek and bluefrog there is nothing to do about it.

Too bad. But thank you for your time.

If you use Keyboard Maestro you could try this macro

Here’s how it looks

2 Likes

The problem (rendering the inline style element) is not related to clutter-free mode, it exists for both MD variants.

Just to prevent any misunderstandings: My complaint was not about DEVONthink’s clutter-free technology.

All clutter-free view modes I know about—in feed reeders like Newsify or Reeder, also the reader modes of different browsers—at some point with some web pages hit a brick wall. There is no unified markup or structure to web pages that makes the actual content reliably distinguishable from other page elements like advertisements. Which are of course the main reason why web page providers have no interest in putting any effort into making this distinction clear.

My guess is the developers apart from using inline styles have set the HTML to minify (that is remove whitespace, line breaks and the like). This is a way to reduce load times, and while common for javascript and css files it’s less common for HTML. This can also happen when Javascript is used to output the HTML, perhaps from a JS based content management system or static site generator.

1 Like

It does not look like that. And whitespace is not relevant in HTML anyway (with the exception of text elements of course).

Removing whitespace reduces file size so yes whitespace is relevant in HTML when it comes to minifcation https://www.imperva.com/learn/performance/minification/

I amend my statement to “whitespace is not relevant to rendering HTML”.

Could you be more specific about which KM macro you’re recommending? There are lots of different ones in that thread, and none of them come from the handle @houthakker.

This macro.