The problem of unicode error in the Capture PDF

Jason30 · December 23, 2021, 5:58am

In capturing PDF, it will cause some Chinese characters to produce wrong unicode.

The unicode encoding of Chinese characters is different in html and pdf.

On the left is the html code，On the right is the pdf code.

西（\u897f）---- ⻄（\u2ec4）
见（\u89c1）---- ⻅（\u2ec5）
首（\u9996）---- ⾸（\u2fb8）

Html encoding is utf-8.

This error caused me a lot of trouble searching and copying text.

How can I avoid this problem？Thanks！

BLUEFROG · December 23, 2021, 12:06pm

Are you capturing as paginated or single page PDF?

Jason30 · December 23, 2021, 12:15pm

Single page,Jim.

As far as I can test, both methods produce the problem and there seems to be no difference.

Is it a problem caused by the PDFkit for macOS?

（Document Introduction:Coding Software:macOS Version 12.1 Quartz PDFContext）
CleanShot 2021-12-23 at 20.29.12@2x

BLUEFROG · December 23, 2021, 12:34pm

Yes this is most likely due to PDFKit.
Do you have an example PDF with the other encoding?

Jason30 · December 23, 2021, 12:43pm

Example PDF.pdf (465.2 KB)

The text I’ve highlighted is all in the wrong unicode (not the same as HTML), and there are more that aren’t marked up.

A strange phenomenon is:

When I open PDF with Adobe acrobat, PDF Expert, PDF PenPro, they have the same wrong Unicode.

When I open the PDF with DevonThink or Preview, the vast majority of the wrong text Unicode is back to normal, but there are still some wrong Unicode.

Jason30 · December 26, 2021, 3:59pm

So … This is an unsolvable problem?

BLUEFROG · December 26, 2021, 5:07pm

If it’s a PDFKit issue, then Apple would have to resolve it.

What’s the URL for that captured page?

jerwin · December 26, 2021, 8:49pm

The left column is drawn from the CJK Ideographs code block, the right column is drawn from the equivalents in the CJK Radicals Supplemental block…

jerwin · December 26, 2021, 10:34pm

This program may prove of some use,

When copying and pasting text from a PDF file, depending on the PDF, kanji characters such as “見” and “高” are often garbled into similar but different characters (e.g. special characters such as Kangxi Radical and CJK Radicals Supplement). This tool fixes such PDF that raises the garbled text extraction and generates a PDF that does not raise the garbled.

Unfortunately, the more extensive documentation (including the technical explanation of the problem) is in Japanese.

Jason30 · December 27, 2021, 3:00am

In fact, as long as there are Chinese pages, the error unicode problem will occur.

Jason30 · December 27, 2021, 3:00am

Thanks, I’ll go check it out!