Script to convert HTML with embedded images to PDF

julienma · February 16, 2020, 5:02pm

Hey there,
I migrated my notes from Evernote to DT3. Most of my DB was single files (jpg/pdf), so this was straightforward to import and convert to PDFs.

However, I have about 300 notes which were exported as HTML with embedded pictures. Usually this is because I either added multiple pictures to a single note, or added some comments with the picture(s). This is a shitty format (textbundle would be better), but Evernote didn’t give me any choice. So HTML is it.

Now, I’d like to convert all these HTML notes to PDF, so I can OCR them.
I could just convert them to PDF as is, but this is not exactly what I want: it would keep the layout of the HTML, which has unnecessary margins, and would force-embed my text comments within the PDF, which I don’t want.

So for now I’m doing it manually like this:

extract the picture(s) from the HTML note
merge all the extracted pictures in a PDF, while keeping the original picture order
copy any text from the HTML note in “Finder Comments”
set the creation date of the new PDF to the original creation date of the HTML file

I tried to fiddle a bit with a DT script, but this is too advanced for me.
Is there any DT automation pro who could help me automate these steps?

Thanks!

BLUEFROG · February 16, 2020, 6:11pm

How are you extracting them?
Can you ZIP and post an example file?

julienma · February 16, 2020, 9:07pm

To extract: I’m drag & dropping each picture from the HTML note (in the preview pane) to the file pane, and this creates a new unknown.jpg/.png file. It seems the pictures are base64-encoded in the HTML.

Here’s an example:
my note.html.zip (104.4 KB)

BLUEFROG · February 17, 2020, 1:33pm

It seems the pictures are base64-encoded in the HTML.

That is correct.