Jason30
December 23, 2021, 5:58am
1
In capturing PDF, it will cause some Chinese characters to produce wrong unicode.
The unicode encoding of Chinese characters is different in html and pdf.
On the left is the html code,On the right is the pdf code.
西(\u897f)---- ⻄(\u2ec4)
见(\u89c1)---- ⻅(\u2ec5)
首(\u9996)---- ⾸(\u2fb8)
Html encoding is utf-8.
This error caused me a lot of trouble searching and copying text.
How can I avoid this problem?Thanks!
Are you capturing as paginated or single page PDF?
Jason30
December 23, 2021, 12:15pm
3
Single page,Jim.
As far as I can test, both methods produce the problem and there seems to be no difference.
Is it a problem caused by the PDFkit for macOS?
(Document Introduction:Coding Software:macOS Version 12.1 Quartz PDFContext)
Yes this is most likely due to PDFKit.
Do you have an example PDF with the other encoding?
Jason30
December 23, 2021, 12:43pm
5
Example PDF.pdf (465.2 KB)
The text I’ve highlighted is all in the wrong unicode (not the same as HTML), and there are more that aren’t marked up.
A strange phenomenon is:
When I open PDF with Adobe acrobat, PDF Expert, PDF PenPro, they have the same wrong Unicode.
When I open the PDF with DevonThink or Preview, the vast majority of the wrong text Unicode is back to normal, but there are still some wrong Unicode.
Jason30
December 26, 2021, 3:59pm
6
So … This is an unsolvable problem?
If it’s a PDFKit issue, then Apple would have to resolve it.
What’s the URL for that captured page?
jerwin
December 26, 2021, 8:49pm
8
The left column is drawn from the CJK Ideographs code block, the right column is drawn from the equivalents in the CJK Radicals Supplemental block…
jerwin
December 26, 2021, 10:34pm
9
This program may prove of some use,
When copying and pasting text from a PDF file, depending on the PDF, kanji characters such as “見” and “高” are often garbled into similar but different characters (e.g. special characters such as Kangxi Radical and CJK Radicals Supplement). This tool fixes such PDF that raises the garbled text extraction and generates a PDF that does not raise the garbled.
Unfortunately, the more extensive documentation (including the technical explanation of the problem) is in Japanese.
Jason30
December 27, 2021, 3:00am
10
In fact, as long as there are Chinese pages, the error unicode problem will occur.
Jason30
December 27, 2021, 3:00am
11
Thanks, I’ll go check it out!
1 Like