How to Cleanly Capture Difficult Webpage?

kappabear · December 1, 2019, 6:46pm

I’m very much interested in learning the best, most easy way to capture a page like what I’ve included via my iPhone or iPad (using iOS 13.x). How would you go about doing so, in a clean, ideally ad/clutter free, manner? (Format, i.e. PDF, Rich Text, Markdown, etc. doesn’t matter to me, though I’d prefer it not be multiple paged PDF)

BLUEFROG · December 1, 2019, 10:45pm

There is no singular, bullet-proof way to capture every webpage. Sometimes it requires fiddling about; some resist capture except by a static format like PDF.

like what I’ve included via my iPhone or iPad (using iOS 13.x).

? What have you included?

kappabear · December 1, 2019, 10:47pm

Hey Jim, I included the link to the SFGate website, for capture testing. Hopefully you see it in this post.

BLUEFROG · December 1, 2019, 10:59pm

Ahh… yes.
I captured the page as a single page PDF - without clutter-free - and got most of it.

It’s possible to capture an HTML format and edit it too. See:

It may not clean up all the underlying JavaScripts, etc. but you’d be surprised.

kappabear · December 1, 2019, 11:09pm

Yeah, mine too but the ads drive me crazy. When I do clutter-free I get an almost completely blank page. But, I guess it’s better than nothing.

I posted the same question in DTTG, wondering how best to capture from iOS 13, as it’s even more restrictive.

BLUEFROG · December 1, 2019, 11:12pm

Unfortunately, iOS is a much more limited platform for operations like the manual cleanup, but if you did capture the page in iOS, you could still clean it up in DEVONthink on the Mac.

kappabear · December 1, 2019, 11:13pm

Is there anyway to view and edit the HTML code in DT3? I haven’t figure out how to, if there is.

BLUEFROG · December 1, 2019, 11:19pm

Yes. For HTML and webarchive files, you can use View > Document Display > Source, if you’re so inclined.

RobH · December 2, 2019, 5:32pm

For hard to grab sites like this, I’ll just make a bookmark and grab it later when I’m sitting at my desktop Mac. For this site, it still required a lot of steps to capture and remove clutter, but mostly because it was such a long web page.

BLUEFROG · December 2, 2019, 5:37pm

Agreed. It’s unusually long. A bit of a bad design IMHO.

kappabear · December 2, 2019, 6:03pm

I’m trying to figure out the best way to use DT3 as a replacement for Pocket and Apple’s Reading List. I’d like to be able to easily grab interesting articles and archive them in a neat, clean and easy fashion. It’s nice to be able to then search for a keyword or two, when I want to later find the article vs sifting through a bunch of bookmarks.

I saved this article as HTML, then manually cleaned it up using TextSoap, and converted it to a Rich Text document, but all that took forever. I hope to find a better solution.

BLUEFROG · December 2, 2019, 6:21pm

Pocket and Reading List just store Bookmarks, so why not just clip the pages as Bookmarks, then add them to DEVONthink’s Reading List?

Here is something I’ve been running quietly (though I just added the other capture formats and an Add to Reading List action)…

This yields a group like this, here targeted in the Global Inbox…

And also adds to the Reading List sidebar.

PS: From my graphic arts and printing background, I tend to compartmentalize data like this group structure. It’s not required, but it’s a good example of using the File action in use with placeholders.

Also notice it’s being very specific with Date Added is This Hour. This preserves items that are already in the targeted location and just acts on newly Clipped items.

And if anyone wants it, or to play about with it:
Organize Web Clippings.dtSmartRule.zip (1.4 KB)

kappabear · December 3, 2019, 4:59pm

I wasn’t aware that Pocket was just saving a bookmark. I’d actually emailed them a few days ago, asking them what format they were storing articles in (bookmark or web archive). I was curious how they dealt with changes, updates or deletions to articles that had been saved in Pocket. I’ve yet to hear from them.

BLUEFROG · December 3, 2019, 5:05pm

I wasn’t aware that Pocket was just saving a bookmark.

As far as I’m aware, that’s what they’re up to. I used it briefly for support some time ago, but I don’t use services like that except when I have to for work.

kappabear · December 3, 2019, 5:07pm

I use it because it’s really easy and convenient, but I’d prefer to keep a readable version of the document in DT3.

BLUEFROG · December 3, 2019, 5:13pm

I just clip into DEVONthink, Bookmarks and all.

kappabear · December 3, 2019, 5:18pm

@BLUEFROG: Makes sense, but I want it to be “pretty” (clutter free), especially since many ads attempt to track you. Christian mentioned that he uses Services -> DEVONthink: Take Rich Note. I like the idea as you’re only capturing the text & images that you want, but it’s a bit tedious and time consuming. And unfortunately Services isn’t available in all browsers, i.e. Firefox. Also, not all of the options are available in all Browsers.

Safari:

Brave:

Here you see that in Brave, you can only Take & Append a Plain Note, as opposed to Rich Text, or Markdown. Any suggestions on how to get a Rich Text document in Brave (or Chrome)? Also, what are Summarize and Lookup in DT3?

BLUEFROG · December 3, 2019, 5:40pm

Services are dependent on what the current application is reporting. This is not under the control of the service developer.

From Help > Documentation > In & Out > Services…

kappabear · December 3, 2019, 5:51pm

Thanks Jim. You ALWAYS have the answers, and I appreciate it!

BLUEFROG · December 3, 2019, 6:04pm

You’re very welcome and thanks for the kind comments. It’s very appreciated!