After installing the latest OS on all my devices, I fiddled with search in DT and DTTG. To my horror, the search function appears to be completely broken in comes to Chinese texts.
A simple explanation on the peculiarities of Chinese script
Fundamentally, the Chinese text is an assembly of individual characters. There is no space between characters, and in fact space is not required anywhere in Chinese writing.
Here is a simple sentence, consisting of 7 characters:
which roughly translates into:
The U.S. (美国) is (是) a (一个) large country (大国)
As you can see, individual characters (e.g. 美 and 国) can be joined together into a single (e.g. 美国) phrase as a semantic unit.
Here comes the interesting thing: Each individual Chinese character carries its own meaning, hence is capable of being a semantic unit on its own. In sentences, however, there is no space (or any other marking) indicating whether a character should be joined to a neighbor or act as an independent. It all depends on the context.
Now let’s return to the example (美国是一个大国).
美 means beautiful. 国 means country. When these two characters are side by side, they can mean, depending on nothing other than the context, either beautiful country or the U.S.
国 means country. 是 means is. In certain context these two characters joins into a single phrase 国是, which dates back to the antiquate times and roughly means national affairs.
That is how flexible the Chinese script is. Consequently, Chinese texts are open for liberal interpretations.
Searching in Chinese texts
With flexibility in mind, text processors should not dictate whether or how characters are joined; Better leave that for the human reader, who presumably is the authority of context and interpretation.
Instead, programs should treat every single Chinese character as a unit in search, in the same way they do every English word.
“United States” counts as two words. “美国” counts as two characters. So simple like that, isn’t it?
To my amusement, that is not how it works in macOS 14 and iOS 17!
The presumed problem: Either DT or the OS, without authorization, joins characters with prejudice, and returns search results based on that prejudiced interpretation.
We’re going to use once again the sentence 美国是一个大国, which is included in the Markdown file below.
Chinese test 中文测试.zip (624 Bytes)
Apparently my computer has decided not to treat each character as separate entities. Rather, it “slices” the sentence into five parts: (美国)(是)(一)(个)(大国), and perform search on exactly that. Therefore I would get the following (somewhat confusing) search results:
- 美国是一个大国 ✓
- 美国是一个大 国 ✕
- 美国是一个大 ✓✕ File shows up but no highlight.
- 美国是一个 大 ✓
- 国是一个大国 ✕
- 国 是一个大国 ✕
- 美国是一 ✓
- 美国 是 一 ✓
- 美国 ✓
- 美 国 ✕
- 国是 ✕
- 国 是 ✓
- 美 大国 ✕
- 美国 大国 ✓
- 国大个一是国美 ✕ Reversing the sentence, understandably, confuses the computer…
- 国 大 个 一 是 国 美 ✕
- 大国个一是美国 ✓ But if I only reverse the sliced sentence, file would show up without highlight.
Search number 1 to 14 are parts of the sentence in normal sequence, so they must not fail, but many of them in fact does. Presumably, 美国 is treated as a single, inseparable phrase, even though it should for all purposes be treated as two separate characters.
The results are consistent across DT and DTTG (latest version for both).
On the other hand, in Apple Notes, all of the 17 searches returns the sentence correctly.
To sum up: Search in Chinese text is completely broken at the moment. I guess the culprit is the automatic “slicing” of sentences, the result of which is erroneously used to as a basis for seach.
I’m not aware if the problem already exists in previous OS versions. This is the first time I personally play around with search in DT.
I appreciate that the DT team has been keeping up to the latest OS release. My apologies if my concern adds to an already substantial pile of emerging problems.