The problem of unicode error in the Capture PDF

In capturing PDF, it will cause some Chinese characters to produce wrong unicode.

The unicode encoding of Chinese characters is different in html and pdf.

On the left is the html code,On the right is the pdf code.

西(\u897f)---- ⻄(\u2ec4)
见(\u89c1)---- ⻅(\u2ec5)
首(\u9996)---- ⾸(\u2fb8)

Html encoding is utf-8.

This error caused me a lot of trouble searching and copying text.

How can I avoid this problem?Thanks! :heart:

Are you capturing as paginated or single page PDF?

Single page,Jim.

As far as I can test, both methods produce the problem and there seems to be no difference.

Is it a problem caused by the PDFkit for macOS?

(Document Introduction:Coding Software:macOS Version 12.1 Quartz PDFContext)
CleanShot 2021-12-23 at 20.29.12@2x

Yes this is most likely due to PDFKit.
Do you have an example PDF with the other encoding?

Example PDF.pdf (465.2 KB)

The text I’ve highlighted is all in the wrong unicode (not the same as HTML), and there are more that aren’t marked up.

A strange phenomenon is:

When I open PDF with Adobe acrobat, PDF Expert, PDF PenPro, they have the same wrong Unicode.

When I open the PDF with DevonThink or Preview, the vast majority of the wrong text Unicode is back to normal, but there are still some wrong Unicode.

So … This is an unsolvable problem? :broken_heart:

If it’s a PDFKit issue, then Apple would have to resolve it.

What’s the URL for that captured page?

The left column is drawn from the CJK Ideographs code block, the right column is drawn from the equivalents in the CJK Radicals Supplemental block…

This program may prove of some use,

When copying and pasting text from a PDF file, depending on the PDF, kanji characters such as “見” and “高” are often garbled into similar but different characters (e.g. special characters such as Kangxi Radical and CJK Radicals Supplement). This tool fixes such PDF that raises the garbled text extraction and generates a PDF that does not raise the garbled.

Unfortunately, the more extensive documentation (including the technical explanation of the problem) is in Japanese.

In fact, as long as there are Chinese pages, the error unicode problem will occur.

Thanks, I’ll go check it out!

1 Like