Script to convert HTML with embedded images to PDF

Hey there,
I migrated my notes from Evernote to DT3. Most of my DB was single files (jpg/pdf), so this was straightforward to import and convert to PDFs.

However, I have about 300 notes which were exported as HTML with embedded pictures. Usually this is because I either added multiple pictures to a single note, or added some comments with the picture(s). This is a shitty format (textbundle would be better), but Evernote didn’t give me any choice. So HTML is it.

Now, I’d like to convert all these HTML notes to PDF, so I can OCR them.
I could just convert them to PDF as is, but this is not exactly what I want: it would keep the layout of the HTML, which has unnecessary margins, and would force-embed my text comments within the PDF, which I don’t want.

So for now I’m doing it manually like this:

  1. extract the picture(s) from the HTML note
  2. merge all the extracted pictures in a PDF, while keeping the original picture order
  3. copy any text from the HTML note in “Finder Comments”
  4. set the creation date of the new PDF to the original creation date of the HTML file

I tried to fiddle a bit with a DT script, but this is too advanced for me.
Is there any DT automation pro who could help me automate these steps?

Thanks!

  • How are you extracting them?
  • Can you ZIP and post an example file?

To extract: I’m drag & dropping each picture from the HTML note (in the preview pane) to the file pane, and this creates a new unknown.jpg/.png file. It seems the pictures are base64-encoded in the HTML.

Here’s an example:
my note.html.zip (104.4 KB)

It seems the pictures are base64-encoded in the HTML.

That is correct.