Clipper no longer clips content of Guardian pages

Hi DEVONians,

Just here to moan! When I clip pages from The Guardian (theguardian.com/), nothing (other than a textual link) is captured, using web archives, markdown, with and without ‘clutter free layouts’.

Not sure how long this has been happening (it will have played havoc with my archiving!) but it may have to do with a redesign they seemed to be incrementally releasing?

I love the clipper in DT. I hope it’s an easy fix. (DTPO 2.11.2)

Thanks

Dave

Unfortunately, the way content is delivered, especially with subscription-based or click-bait sites, the data you’re seeing is often not present in the actual page. The clutter-free option attempts to find only the content in the page, but if it cannot you’ll get the results you’re seeing. Currently, the only option is not using clutter-free. Other options may become available in the future, as time allows. Thanks for your patience and understanding.

Thanks for the follow-up, Jim - however, it’s the same clutter-free as it is not clutter-free.

The only thing that captures is pdf, which obviously is a massive waste of space and computational resource!

I guess the culprit is that (standard) pages are wrapped in a

element which isn’t named content:

and I’d assume that the clipper is looking for a div named “content”?

If so, I’ll flag this up as an issue with the paper.

I clipped this page and got the content, with a few missing images: theguardian.com/environment … ing-report

I also just clipped it clutter-free and got the essential text, no images.

What browser are you using?

Hmm. Yes, I clipped that to confirm, and it came through fine (even using ‘use clutter-free layout’).

Maybe it was just a few articles I was looking at that it’s missed; I had gotten the impression that a few items I’d clipped had not come through as robustly as I had been lead to be used to by the clipper. Maybe it’s just me and capturing the wrong type of online posts…!

Thanks for checking up, Jim! (glad to see you’re reading the same upbeat stories I am…)

(FYI I’m using Safari with uBlock Origin 1.16.0 and 1Blocker)

This problem still remains unsolved. I tested with a couple of URLs (posted below) from the Guardian, here’s what I found out:

In Safari:

  • capturing HTML page works, but the result is full of clutter (including a Cookie notice and an overlay request to support the guardian).
  • capturing a Web Archive works (but still contains a lot of extra content, which will mess up the search in Devonthink, because there’s a lot of links to other articles, including the ten most read articles the day this was captured)
  • any attempt to capture as clutter-free web archive in Safari leads to only a small bit of HMTL containing a link to the article

In Chrome, any attempt to capture the first page above link leads to a page about:blank#blocked, even if I disable ALL other extensions (it was painful to see what the web looks like these days without an adblocker).

Attempts to capture the other URLs in Chrome were consistent with what happened in Safari, except for two minor details:

  • some of the black text had a dark blue background color, which makes that part of the text rather unreadable.
  • it asked me for permission to open DevonThink every single time I captured a page

I understand how hard it is to make a web clipper, that’s why I suggest building on other people’s efforts, e.g. I found that when I use the Mercury Reader extension on Chrome, the article looks just fine. Since Mercury is Open Source, and also has a CLI web parser (see mercury.postlight.com/web-parser/), this might be a worthwhile thing to look at.

Here’s the URLs I tested:

  • theguardian.com/commentisfree/2019/oct/30/changing-world-better-economics-honest-humane
  • theguardian.com/world/2019/oct/24/western-liberalism-failed-post-communist-eastern-europe
  • theguardian.com/environment/2019/feb/04/a-third-of-himalayan-ice-cap-doomed-finds-shocking-report

(for some reason I can’t seem to include links in my post, so I had to mark them up differenty)

1 Like