Wrong text (layer) when capture PDF from viewer window

I’m using a script to capture PDF using the following (similar to what @mhucka uses):

set captureWindow to open window for record captureRecord with force
set contentAsPDF to get PDF of captureWindow

When copying text from the resulting PDF to my surprise there were errors in the text, as if the content was OCR’ed - rather than taken directly from the webpage. Which is weird, because all links work - so some text/properties are directly available when creating the PDF. These text errors don’t happen when using the Safari Extension e.g.

Are pages captured from the record window actually OCR’ed? Or is something else happening? This is the page in question: RVS: huidig stelsel frustreert domeinoverstijgende samenwerking - Skipr - check e.g. the words “Creatieve financiering” (at the bottom) which ends up as “Creatieve Ananciering” in the PDF from the window

1 Like

No. OCR is only performed on demand, not via the command you’re referring to.
Also, OCR wouldn’t be done on a clipped file from a website like you’re referring to.

However, I have confirmed an issue when using a construct like the one you’re using.

tell application id "DNtp"
	set sel to (selected record 1)
	
	-- set newPDF to (get PDF of (viewer window 1)) -- Yields the error
	set newPDF to (get paginated PDF of (viewer window 1)) -- Also, yields the error
	
	set newRec to create record with {name:"PDF Test", type:PDF document} in current group
	set data of newRec to newPDF
end tell

@cgrunenberg would have to assess this but bear in mind he’s on a well-deserved long weekend.

Thanks - that was what I expected. Any idea why then text results are different?

I’m not so sure they even are. If you look at the original on the website, you’ll see a ligature “fi”, i.e. the i is run together with the i. Now, in your PDF, the “A” you’re seen might simply be a badly rendered ligature “fi”. To check, copy this “Ananciering” and paste it in a normal text field (e.g. a newly created text document in DT). What do you see there?

I’m seeing exactly that: the A when pasted in a normal text field - that’s how I noticed the difference. See the attached PDF

RVS- huidig stelsel frustreert domeinoverstijgende samenwerking - Skipr-3.pdf (112.3 KB)

Just seeing this update. Thanks for confirming. Looking forward to @cgrunenberg’s assessment later!

You’re welcome :slight_smile:

Right. I tried it myself and that’s what I saw the same. There seem to be occasional problems with ligatures in PDF files. But here they are not even ligatures, they are simply “f” and “i” that the font decides to run together.

Known limitation of AppleScript, not one of DEVONthink unfortunately. Immediately specifying the data via the create record command should work.

PDF documents retain only the layout, not necessarily exactly the original text as the conversion to PDF and back to (indexed) text is not always lossless.

Thanks - I didn’t expect that but it’s interesting to understand that. And this is different for the two methods: capturing a PDF from a window vs capturing a PDF via the create PDF from URL command?

The results might vary again as the first approach uses the currently rendered web page, the second one downloads & renders the page on its own (and especially in case of dynamic websites a lot of things might be different).

If it had something to do with dynamically loaded parts I might understand, but as @chrillek notes, these are just different letters, nothing special happening there as I can see. But there might be some font trickery happening I don’t understand. Some searching reveals this might be happening more often, especially with combinations with ‘f’: macos - Disable automatic "ligature" handling in PDF/Preview on El Capitan - Ask Different :

When text appears in a TextEdit or Word document, the underlying data maintains the fi pair of characters, but displays them as one glyph: the ligature. If the font doesn’t have the ligature character, then you see both letters separately.

When a PDF is made, the display glyph is used, but the underlying pair of ‘real letters’ is not maintained within the data. PDF was designed as a description of displayed content.

Any PDF that contains an alternative glyph or ligature may not display the correct data when the text is copied and pasted, unless there’s a hidden text layer that contains the ‘correct’ lettering.

After having some time to think about it, I’d like to amend my previous assessment. The starting point is a HTML document with a certain visual appearance. Let’s just stay with the aspect of “font” here.

When the browser displays the document, it loads the font (either from a local or a remote location) and uses it to draw the glyphs. Now someone says “hey, I want a PDF from that document.” How would the browser (or any other software) procede to achieve this?

First, the font in question is not part of the standard PostScript/PDF repertoire. So it might have to be embedded into the resulting document. That shouldn’t create a big technical hurdle. But: Does the font’s license permit embedding it into a PDF, or is it maybe restricting the font’s usage to the Web? No way to find out for the browser. So what I guess (!) would be the safest way to might be to create an image from the HTML document and add that as it is to the PDF (no problem at all here, it is only an image, so the font is not used to draw the text anymore, it’s basically a photo taken of the HTML document).
But of course the user wants to have the text of the document, not only a visually pleasing image. So in the next step, the converter performs OCR on this image. And this might in fact explain why “fi” ligature in the HTML document becomes “A” in the PDF.

All this is just a wild guess, of course. And: If you print the HTML to PDF from the browser, everything seems to be fine, at least in my Firefox (i.a. the fi ligature is recognized in the PDF as “fi”). Actually, there’s another argument for printing to PDF instead of clipping the document: Web sites might provide a print style sheet that takes care of page breaks at appropriate places, removing unnecessary elements of the HTML documents (like navigation) etc.

That was my original assumption. Maybe DT isn’t doing the OCR, but I’d guess too ‘internally’ PDFkit is doing something along the lines of what you’re laying out here.

As you might have noticed from my other posts, I’ve been doing quite some research into capturing PDFs and my conclusion so far is that no single way is perfect:

  1. Capturing via “Export to PDF” in Safari gives great PDFs visually, but often links don’t work and it’s much harder to automate (although possible)
  2. Capturing via DT and create PDF from document works great, but logged in sites don’t work
  3. Capturing via DT and PDF of (viewer window 1) works well for logged in sites, but creates font / text problems as described in this thread
  4. Capturing via clutter-free or print options sometimes delivers ‘clean’ PDFs - but also often scrambles layout to such extent it doesn’t provide enough quality for long-term storage (much context goes missing)
  5. Capturing via image and OCR breaks links and delivers huge PDFs which makes it unusable for my case (long term storage)

So basically I either have to manually assess each option and (re)capture based on what works or go for the solution with the least amount of problems (in my case option 3). Any tips on how to improve are welcome - hopefully one day a perfect option will exist :slight_smile:

5 Likes

Yep, that’s what I’m doing t the moment. Tedious!

At what doc type do you capture with DT amd where do you run the PDF of (viewer window 1) Applescript?

Not sure what you mean here? I capture mostly bookmarks to PDF. See @mhucka’s script for an example of running the Applescript command mentioned https://github.com/mhucka/devonthink-hacks/tree/main/auto-convert-web-page-to-PDF

Thanks for doing this research and summarizing the results. This kind of thing is really helpful.

1 Like