Need help determining correct product - need to scrape entire forum

Hello all! Hoping someone can point me in the right direction here. I’ve heard how devonthink is a great archiving software but I a bit confused as to what I need to purchase to do what I want.

I would like to use Devonthink to make a archive of a forum, I would also like for it to scrape all images that are posted on the forum as well.

If it can also provide me a way to search for all of the hyperlinks on the pages after the page is archived and export those hyperlink urls that would be great.

Am I in the right spot? I appreciate any input before I decide to start the trial (assuming there is one still)

Thanks!

Actually this sounds like a typical task for a web scraper which is not really DEVONthink’s primary usage scenario. Does the forum require a login?

Thanks for following up, yes the forum does require a login.

I did go ahead and download the trial and I got it to scrape a single page via the download manager but I was hoping it would go throughout the links on the page vs the subdirectories on the webserver.

Seriously? If I put a link to a Google search in the forum, would DT than habe to follow that link and download the Google index?

I agree with @cgrunenberg: use curl or wget or something like that to mirror the forum. Those are made for this task.

Obviously it would not be able to do that without some type of additional logic. This application supports scripting so I figured it nay be possible. Someone posted a script that actually did grab the pdfs off of a page so if that is possible then it’s not too much of a stretch to think it could be possible to grab a hyperlink instead.

Thanks for the suggestion, though I’ll look into those products.

If you find a way to reliably parse HTML documents and analyse URLs, it just might. Though it will probably take a lot more work than with the appropriate tools.

Also, you might run into problems with dynamically generated content, regardless of the technique you use.

1 Like

As suggested by others, curl or wget (available on your Mac) are more appropriate. And your Mac has tremendously capable scripting/programming languages available (shell scripts, Python, Perl, etc.) so that you can bend curl and wget (or the language’s libraries) to your will.

2 Likes

I have used SiteSucker a few times, with good results.

3 Likes

Hello – I have some experience in this area. What you described (often known as web archiving) is a deceptively difficult problem. When we see web sites in our browsers, we often think it should be straightforward to save a copy of what we see, but unfortunately, the software technologies used behind the scenes can make faithful archiving difficult or impossible.

For some sites, people have written specialized archiving tools. You didn’t mention the forum in question, but it might be worth searching for terms such as “how archive” combined with the name of the forum, to see if anyone has made something available. When specialized tools are not available, sometimes people post scripts or recipes for using common tools to do the job (e.g., someone last year posted a suggestion for how to archive a Discourse site using wget). Otherwise, the options range from trying general-purpose tools to writing custom web scrapers. The options will differ in terms of the completeness and fidelity of the archives they create, their ability to handle site rate limits and/or authentication, the time it takes you to implement the solution, and more.

Some general pointers:

  1. The simplest and best “high fidelity” archiving system that I’m aware of is Conifer. It’s interactive (thus letting it deal with password-protected sites) and stores archives in a standard format.
  2. Some commercial services exist, e.g., ParseHub, Web Scraper, OctoParse, and probably others. (This is not an endorsement, and I have no affiliation with them.)
  3. If you’re not interested in archiving a site as faithfully as possible and merely want to extract certain things like the URLs on the pages, then:
    1. If you have programming experience, you can find high-level scraping frameworks for a number of languages and may be able to write a tool more quickly than if you started with basic networking API libraries. E.g., for Python there is Scrapy.
    2. If you don’t want to write a program, then as others have mentioned above, you may be able to use SiteSucker or similar applications, or if you prefer the command line, wget, curl, or similar command-line tools.

I don’t think DEVONthink is ideal for this purpose. I’ve used SiteSucker a couple of times and it worked reasonably well. When SiteSucker fails, my go-to starting point is wget. (I wrote a simple script to create a sitemap using wget, which I mention here only because you might be able to see the wget arguments used in that one as a starting point for writing a script of your own.) I’ve written a number of very specialized scrapers in Python for extracting URLs (e.g., eprints2archives) and it’s not difficult, but it is time consuming to create a robust, thorough tool, especially if you want to produce something that other people can use.

6 Likes

Thank you for your detailed response, I’lI find more information about the forum as it is a private forum but I understand that you are referring to the backend forum software that is being used.

After reading the suggestions I think using curl or wget may get me close to what I’m looking for.

I have some scripting experience with powershell but not python but I’m confident I can reverse engineer something if I need. I will also look into your script!

Thank you so so much everyone especially mhucka!!

2 Likes