A not so prefect better web clipping ideas and welcome improvments

mhucka · November 30, 2021, 5:23pm

The problem of saving web pages faithfully in any format is something that occupies many researchers in academia and industry today. The fact that DEVONthink doesn’t do a great job shouldn’t be held against it; as @rkaplan alluded to above, many kinds of web pages today contain dynamic content that is very difficult to capture using a single universal approach, and basically nothing does a perfect job.

In some cases, you can get better results if you know something about the software running the site (e.g., Discourse has special features for printing to PDF that makes it possible to overcome its dynamic loading/unloading behavior). Failing that, another approach to capturing content of pages that load content dynamically as the user scrolls the page (e.g., Twitter, some product web sites) is to simulate user behavior like scrolling down the page. My approach in DEVONthink has been a hack: use a script that runs some javascript commands to scroll the page down, in an attempt to force content to be loaded, before saving the page as PDF. An explanation and code can be found here:

Some other past DEVONthink discussions related to this topic: