Need help determining correct product - need to scrape entire forum

Authrom · January 31, 2022, 6:37am

Hello all! Hoping someone can point me in the right direction here. I’ve heard how devonthink is a great archiving software but I a bit confused as to what I need to purchase to do what I want.

I would like to use Devonthink to make a archive of a forum, I would also like for it to scrape all images that are posted on the forum as well.

If it can also provide me a way to search for all of the hyperlinks on the pages after the page is archived and export those hyperlink urls that would be great.

Am I in the right spot? I appreciate any input before I decide to start the trial (assuming there is one still)

Thanks!

cgrunenberg · January 31, 2022, 7:41am

Actually this sounds like a typical task for a web scraper which is not really DEVONthink’s primary usage scenario. Does the forum require a login?

Authrom · January 31, 2022, 8:38am

Thanks for following up, yes the forum does require a login.

I did go ahead and download the trial and I got it to scrape a single page via the download manager but I was hoping it would go throughout the links on the page vs the subdirectories on the webserver.

chrillek · January 31, 2022, 8:41am

Seriously? If I put a link to a Google search in the forum, would DT than habe to follow that link and download the Google index?

I agree with @cgrunenberg: use curl or wget or something like that to mirror the forum. Those are made for this task.

Authrom · January 31, 2022, 8:56am

Obviously it would not be able to do that without some type of additional logic. This application supports scripting so I figured it nay be possible. Someone posted a script that actually did grab the pdfs off of a page so if that is possible then it’s not too much of a stretch to think it could be possible to grab a hyperlink instead.

Thanks for the suggestion, though I’ll look into those products.

chrillek · January 31, 2022, 9:08am

If you find a way to reliably parse HTML documents and analyse URLs, it just might. Though it will probably take a lot more work than with the appropriate tools.

Also, you might run into problems with dynamically generated content, regardless of the technique you use.

rmschne · January 31, 2022, 9:15am

As suggested by others, curl or wget (available on your Mac) are more appropriate. And your Mac has tremendously capable scripting/programming languages available (shell scripts, Python, Perl, etc.) so that you can bend curl and wget (or the language’s libraries) to your will.

alastor933 · January 31, 2022, 11:08am

I have used SiteSucker a few times, with good results.

mhucka · January 31, 2022, 5:09pm

Hello – I have some experience in this area. What you described (often known as web archiving) is a deceptively difficult problem. When we see web sites in our browsers, we often think it should be straightforward to save a copy of what we see, but unfortunately, the software technologies used behind the scenes can make faithful archiving difficult or impossible.

For some sites, people have written specialized archiving tools. You didn’t mention the forum in question, but it might be worth searching for terms such as “how archive” combined with the name of the forum, to see if anyone has made something available. When specialized tools are not available, sometimes people post scripts or recipes for using common tools to do the job (e.g., someone last year posted a suggestion for how to archive a Discourse site using wget). Otherwise, the options range from trying general-purpose tools to writing custom web scrapers. The options will differ in terms of the completeness and fidelity of the archives they create, their ability to handle site rate limits and/or authentication, the time it takes you to implement the solution, and more.

Some general pointers:

The simplest and best “high fidelity” archiving system that I’m aware of is Conifer. It’s interactive (thus letting it deal with password-protected sites) and stores archives in a standard format.
Some commercial services exist, e.g., ParseHub, Web Scraper, OctoParse, and probably others. (This is not an endorsement, and I have no affiliation with them.)
If you’re not interested in archiving a site as faithfully as possible and merely want to extract certain things like the URLs on the pages, then:
1. If you have programming experience, you can find high-level scraping frameworks for a number of languages and may be able to write a tool more quickly than if you started with basic networking API libraries. E.g., for Python there is Scrapy.
2. If you don’t want to write a program, then as others have mentioned above, you may be able to use SiteSucker or similar applications, or if you prefer the command line, wget, curl, or similar command-line tools.

I don’t think DEVONthink is ideal for this purpose. I’ve used SiteSucker a couple of times and it worked reasonably well. When SiteSucker fails, my go-to starting point is wget. (I wrote a simple script to create a sitemap using wget, which I mention here only because you might be able to see the wget arguments used in that one as a starting point for writing a script of your own.) I’ve written a number of very specialized scrapers in Python for extracting URLs (e.g., eprints2archives) and it’s not difficult, but it is time consuming to create a robust, thorough tool, especially if you want to produce something that other people can use.

Authrom · January 31, 2022, 5:30pm

Thank you for your detailed response, I’lI find more information about the forum as it is a private forum but I understand that you are referring to the backend forum software that is being used.

After reading the suggestions I think using curl or wget may get me close to what I’m looking for.

I have some scripting experience with powershell but not python but I’m confident I can reverse engineer something if I need. I will also look into your script!

Thank you so so much everyone especially mhucka!!

system · January 30, 2025, 5:30pm

This topic was automatically closed 1095 days after the last reply. New replies are no longer allowed.