Wrong text (layer) when capture PDF from viewer window

mdbraber · March 7, 2022, 2:29pm

That was my original assumption. Maybe DT isn’t doing the OCR, but I’d guess too ‘internally’ PDFkit is doing something along the lines of what you’re laying out here.

As you might have noticed from my other posts, I’ve been doing quite some research into capturing PDFs and my conclusion so far is that no single way is perfect:

Capturing via “Export to PDF” in Safari gives great PDFs visually, but often links don’t work and it’s much harder to automate (although possible)
Capturing via DT and create PDF from document works great, but logged in sites don’t work
Capturing via DT and PDF of (viewer window 1) works well for logged in sites, but creates font / text problems as described in this thread
Capturing via clutter-free or print options sometimes delivers ‘clean’ PDFs - but also often scrambles layout to such extent it doesn’t provide enough quality for long-term storage (much context goes missing)
Capturing via image and OCR breaks links and delivers huge PDFs which makes it unusable for my case (long term storage)

So basically I either have to manually assess each option and (re)capture based on what works or go for the solution with the least amount of problems (in my case option 3). Any tips on how to improve are welcome - hopefully one day a perfect option will exist