I have a collection of html web news articles in my database that I scraped from a trade news publication using “downloadthemall” with Firefox. I’d like them to be clutter free pdf (without all the other webpage text and images). I can open each one one-at-a-time and reclip them, but that is too tedious. Is there a way to bulk convert the html pages to clutter free pdf through automation?
The other option is to clip them all properly at source, but that is also too tedious.
Alternatively, I’d be open to a different workflow to start for scraping/clipping a list of links to news sources. I used the “downloadthemall” Firefox extension because I was familiar with it, but there may be a better way using devonthink or some other tool that gets me to a list of clutter free pdf representations of those articles (but that may be a better question for the MPU forum).
Assuming that the URL fields of the HTML documents are proper links, selected the documents and run the following script:
tell application id "DNtp"
set theRecords to selected records
set destinationGroup to incoming group -- or specify a destination for the PDFs created
repeat with theRecord in theRecords
if (URL of theRecord) starts with "http" then
create PDF document from (URL of theRecord) in destinationGroup name (name of theRecord) with readability
end if
end repeat
end tell
Note: This script has not been tested. You might want to try it on a few ones first.
Sorry if I wasn’t clear. I could do that, but the trade publication which I subscribe to and I’m trying to “scrape” allows for smart search for (for example) “Canada & CUSMA”, which returns a page of 100 links to articles. They are nearly all relevant to my work, so I would like a workflow to download all the linked articles on the page without having to open each one individually. “Downloadthemall” allows that, but only in html - I still have to find a way to automate the conversation else I am converting from html one-by-one anyway. If DT offers a solution to do that sort of automation (i.e. open and download all linked articles and convert to clutter free pdf), then I’m all ears. But, as you say, the best option may be to just open and clutter free pdf each linked web article manually.
I totally agree. I meant only that if the workflow to automate the downloading and conversion of linked webpages is better done outside of DT3; I did not think it fair to expect that the DT3 community provide me with that workflow. I did hope that someone here would have a solution, but just wanted to make it clear that I didn’t think this community should be contributing to all my tech related problems.
Okay, I thought I figured it out. I don’t even need “download themall” as you implied. The script Download/Links of Page downloads all the articles as html.
Then, I tried to highlight all the html files and run the script, Download/As clutter free pdf. I thought this was working to convert them all to clutter free pdf+text, but upon inspection it was cutting off the article (see attached). I’ve logged into my account for the periodical both in safari and within the DT browser, but it still isn’t working - maybe it is still a paywall issue which of course is not DT3’s fault.
Hi, thanks for responding.
Yes, I’m logged in on the DT3 browser. If I open the truncated pdf generated by the script in DT3 and click through the link it opens the original webpage in safari and it clips fine from there.
It is just the script doing the bulk pdf clipping from the html files/links within DT3 that isn’t working through the site.
I did check with a non paywalled source and this workflow worked fine.
I suppose it’s not that big a deal to do it manually - it was nice to find the “downloadthemall” functionality from within DT3! I just seemed so close to being able to automate it…!
Maybe this is just another one of those cases where just doing it manually would have been faster than trying to play around with figuring out some way to automate it! But, I still learned something.