When I convert web pages to PDF, some words have errors

Aiden · June 12, 2020, 1:40am

When I copy this line, I get "Luhmann was extremely proli3c”. “prolific” becomes “proli3c”

When I copy this line, I get “Theprincipleofatomicity:ThetermwascoinedbyChristianTietze.” The spaces between the words are missing.

English is not my first language, so I often need to look up words and this error prevents me from using the Apple dictionary directly. There are many more such errors in PDF files, why does this happen?

cgrunenberg · June 12, 2020, 7:37am

Maybe an issue of the PDFkit’s selection & text handling or an issue of the way the PDF documents were created, e.g. this could be OCR errors (probably not in this case). How did you exactly convert the web page?

Aiden · June 12, 2020, 7:56am

I first saved the page in HTML format to DEVONthink, then converted it to PDF in DEVONthink (click action and then PDF), and copied the text in DEVONthink as well.

Here’s the URL of that page, if you want to test it.

BTW, the PDF generated by the web page is not a text PDF? Why do we need OCR?

cgrunenberg · June 12, 2020, 9:27am

Not sure if it’s an issue of the WebKit and/or PDFkit framework of macOS but I could also reproduce this by printing the page to a PDF in Safari and opening it in Preview. Unfortunately only Apple can fix this.

JohnAtl · June 12, 2020, 4:29pm

If I may, this is usually caused by ligatures in the PDF. A ligature is a single character that represents multiple characters, and is intended to make the text look better.
The usual suspects are ‘ff’, ‘fi’, fl, etc.

From fonts.com
download (1)

To illustrate how I’ve dealt with it, here is a TextSoap cleaner I created to help with this: