Bulk convert html to clutter free pdf (or bulk clip clutter free pdf from list of links)

talundbl · December 16, 2024, 11:52am

Hello,

I have a collection of html web news articles in my database that I scraped from a trade news publication using “downloadthemall” with Firefox. I’d like them to be clutter free pdf (without all the other webpage text and images). I can open each one one-at-a-time and reclip them, but that is too tedious. Is there a way to bulk convert the html pages to clutter free pdf through automation?

The other option is to clip them all properly at source, but that is also too tedious.

Alternatively, I’d be open to a different workflow to start for scraping/clipping a list of links to news sources. I used the “downloadthemall” Firefox extension because I was familiar with it, but there may be a better way using devonthink or some other tool that gets me to a list of clutter free pdf representations of those articles (but that may be a better question for the MPU forum).

Thanks,

meowky · December 16, 2024, 1:12pm

Edit:

See @cgrunenberg's post below

Assuming that the URL fields of the HTML documents are proper links, selected the documents and run the following script:

tell application id "DNtp"
	set theRecords to selected records
	set destinationGroup to incoming group -- or specify a destination for the PDFs created
	repeat with theRecord in theRecords
		if (URL of theRecord) starts with "http" then
			create PDF document from (URL of theRecord) in destinationGroup name (name of theRecord) with readability
		end if
	end repeat
end tell

Note: This script has not been tested. You might want to try it on a few ones first.

cgrunenberg · December 16, 2024, 1:16pm

See also Scripts > Download > As Clutter-Free PDF Documents (One Page) and Scripts > Download > As Clutter-Free PDF Documents (Paginated)

meowky · December 16, 2024, 2:02pm

Thanks. I just realized that those Download scripts apply to more than bookmarks. Nice work there!

BLUEFROG · December 16, 2024, 2:31pm

but there may be a better way using devonthink or some other tool that gets me to a list of clutter free pdf representations of those articles.

Why aren’t you just using our browser extension and clipping pages to clutter-free PDF as you’re viewing them?

(but that may be a better question for the MPU forum)

MPU is not the better place to ask DEVONtech-related questions.

talundbl · December 17, 2024, 1:01am

Sorry if I wasn’t clear. I could do that, but the trade publication which I subscribe to and I’m trying to “scrape” allows for smart search for (for example) “Canada & CUSMA”, which returns a page of 100 links to articles. They are nearly all relevant to my work, so I would like a workflow to download all the linked articles on the page without having to open each one individually. “Downloadthemall” allows that, but only in html - I still have to find a way to automate the conversation else I am converting from html one-by-one anyway. If DT offers a solution to do that sort of automation (i.e. open and download all linked articles and convert to clutter free pdf), then I’m all ears. But, as you say, the best option may be to just open and clutter free pdf each linked web article manually.

I totally agree. I meant only that if the workflow to automate the downloading and conversion of linked webpages is better done outside of DT3; I did not think it fair to expect that the DT3 community provide me with that workflow. I did hope that someone here would have a solution, but just wanted to make it clear that I didn’t think this community should be contributing to all my tech related problems.

talundbl · December 17, 2024, 2:04am

Okay, I thought I figured it out. I don’t even need “download themall” as you implied. The script Download/Links of Page downloads all the articles as html.

Then, I tried to highlight all the html files and run the script, Download/As clutter free pdf. I thought this was working to convert them all to clutter free pdf+text, but upon inspection it was cutting off the article (see attached). I’ve logged into my account for the periodical both in safari and within the DT browser, but it still isn’t working - maybe it is still a paywall issue which of course is not DT3’s fault.

canadian-envoy-usmca-could-provide-forum-carbon-talks.html.pdf (39.0 KB)

BLUEFROG · December 17, 2024, 2:47am

Are you logged into the site and clipping from within DEVONthink?

Also, the clutter-free mechanism happens outside a browser or DEVONthink so you may have to give up on that option for this site.

talundbl · December 17, 2024, 3:11am

Hi, thanks for responding.
Yes, I’m logged in on the DT3 browser. If I open the truncated pdf generated by the script in DT3 and click through the link it opens the original webpage in safari and it clips fine from there.
It is just the script doing the bulk pdf clipping from the html files/links within DT3 that isn’t working through the site.

I did check with a non paywalled source and this workflow worked fine.

I suppose it’s not that big a deal to do it manually - it was nice to find the “downloadthemall” functionality from within DT3! I just seemed so close to being able to automate it…!

Maybe this is just another one of those cases where just doing it manually would have been faster than trying to play around with figuring out some way to automate it! But, I still learned something.

BLUEFROG · December 17, 2024, 4:10am

click through the link it opens the original webpage in safari and it clips fine from there.

Using clutter-free or not?

But, I still learned something.

I’d call that a win!

PS: you’re welcome.

talundbl · December 17, 2024, 11:08am

Yes, using clutter free works through safari onvthe paywalled website when I’m logged in to the website.