Generating PDFs from Web Archives

The Behavior:

Converting a Web Archive file to a PDF in DT3 produces unpredictable and often subpar results. Frequently printing the record to a PDF results in a much better conversion. (Presumably this all has to do with CSS parameters used by the print Media Query at the time of the conversion.)

As an example, here is a web archive file with sidebars on the left and right:

If I convert the Web Archive to a PDF I get this, which is sub optimum in lots of respects:

I get a very similar outcome when I use ⌘P to print the Web Archive to a PDF:

Here is where it gets interesting. Now go back to the Web Archive file and manually reduce the width of the window until the sidebars vanish:

When I use Convert to PDF in DT3 I get the same result as before. Changing the display does not affect the conversion:

BUT when I print the new display of the Web Archive to a PDF in DEVONThink I get what I want. The following…

…gives me a PDF that looks like this:

The Question(s):

Is there a straightforward way to automate (script) printing (NOT converting) a DT3 record to a PDF inside DT3? I could use a tool like Keyboard Maestro to run through selected DT3 records, print them as PDF files, then read them back into DT3, but that is kludgy.

Or (better!!!) is there a way to control parameters at the time a file is converted to a PDF? This clearly all has to do with the CSS print Media Query at the time of the conversion. It would be very useful to be able to specify parameters at the time of the conversion.

Going a step further, it would be nice to have WebDev developer tools available for web archives inside DT3.

Or should I punt entirely and do Web → PDF conversions outside of DT3 altogether?

Did you actually use Data > Convert which converts one or more selected items to the desired format or Tools > Capture? In the first case an eventually visible document doesn’t matter whereas capturing uses the currently web page.

Yes, I understand the different options. I have played with just about all the permutations I can think of, including downloading the PDF in a web browser then modifying @media print {...} by hand. That works. I can set up a print style sheet. But it makes the downloading process itself tedious.

I played around a little bit with breaking into the DT3 Web Archive itself and modifying the CSS, but it wasn’t worth the trouble.

My workflow:

  1. Gather a list of URLs that I want to bring together into a single PDF.

  2. Automatically download each of the URLs. The best option that I have found is saving the pages in Web Archive format using create web document from. (Yes, I have tried just going directly to PDF here using create PDF document from and consistently gotten poor results.)

  3. Automatically convert each of the Web Archive files into a PDF. This is the focus of the current question. There are two basic ways to approach that process:

  • Use DT3’s script command, convert record <myRecord> to PDF, which is the equivalent of the manual Actions → Convert → to PDF (paginated). As shown above, the output of that is often… poor.

  • Use the OS Print → Save as PDF to DEVONThink 3 service. That is not guaranteed to work well, but it often works better and can be nudged about by changing how the Web Archive is displayed in the DT3 window.

  1. Stack the PDFs with merge records.

Since posting I wrote an automation that uses UI scripting to get DT3 to automatically print PDFs to itself. Once I have the list of URLs I can go through the whole process with a single command. But the quality of the conversion is still hit and miss, and as with most things System Events it’s a kludge that could break at any minute.

What I was really hoping was that someone would come back and say, “Sure, you can get into Web Archives with WebDev tools, and here is how you can set up your own style sheets for conversion to PDFs.”

Developer tools are available in DEVONagent’s browser. Just in case you didn’t know.

1 Like