Broken Search in Chinese texts in macOS 14 and iOS 17

meowky · September 28, 2023, 12:03pm

After installing the latest OS on all my devices, I fiddled with search in DT and DTTG. To my horror, the search function appears to be completely broken in comes to Chinese texts.

A simple explanation on the peculiarities of Chinese script

Fundamentally, the Chinese text is an assembly of individual characters. There is no space between characters, and in fact space is not required anywhere in Chinese writing.

Here is a simple sentence, consisting of 7 characters:

美国是一个大国

which roughly translates into:

The U.S. (美国) is (是) a (一个) large country (大国)

As you can see, individual characters (e.g. 美 and 国) can be joined together into a single (e.g. 美国) phrase as a semantic unit.

Here comes the interesting thing: Each individual Chinese character carries its own meaning, hence is capable of being a semantic unit on its own. In sentences, however, there is no space (or any other marking) indicating whether a character should be joined to a neighbor or act as an independent. It all depends on the context.

Now let’s return to the example (美国是一个大国).

美 means beautiful. 国 means country. When these two characters are side by side, they can mean, depending on nothing other than the context, either beautiful country or the U.S.

国 means country. 是 means is. In certain context these two characters joins into a single phrase 国是, which dates back to the antiquate times and roughly means national affairs.

That is how flexible the Chinese script is. Consequently, Chinese texts are open for liberal interpretations.

Searching in Chinese texts

With flexibility in mind, text processors should not dictate whether or how characters are joined; Better leave that for the human reader, who presumably is the authority of context and interpretation.

Instead, programs should treat every single Chinese character as a unit in search, in the same way they do every English word.

“United States” counts as two words. “美国” counts as two characters. So simple like that, isn’t it?

To my amusement, that is not how it works in macOS 14 and iOS 17!

The presumed problem: Either DT or the OS, without authorization, joins characters with prejudice, and returns search results based on that prejudiced interpretation.

We’re going to use once again the sentence 美国是一个大国, which is included in the Markdown file below.

Chinese test 中文测试.zip (624 Bytes)

Apparently my computer has decided not to treat each character as separate entities. Rather, it “slices” the sentence into five parts: (美国)(是)(一)(个)(大国), and perform search on exactly that. Therefore I would get the following (somewhat confusing) search results:

美国是一个大国 ✓
美国是一个大国 ✕
美国是一个大 ✓✕ File shows up but no highlight.
美国是一个大 ✓
国是一个大国 ✕
国是一个大国 ✕
美国是一 ✓
美国是一 ✓
美国 ✓
美国 ✕
国是 ✕
国是 ✓
美大国 ✕
美国大国 ✓
国大个一是国美 ✕ Reversing the sentence, understandably, confuses the computer…
国大个一是国美 ✕
大国个一是美国 ✓ But if I only reverse the sliced sentence, file would show up without highlight.

Search number 1 to 14 are parts of the sentence in normal sequence, so they must not fail, but many of them in fact does. Presumably, 美国 is treated as a single, inseparable phrase, even though it should for all purposes be treated as two separate characters.

The results are consistent across DT and DTTG (latest version for both).

On the other hand, in Apple Notes, all of the 17 searches returns the sentence correctly.

To sum up: Search in Chinese text is completely broken at the moment. I guess the culprit is the automatic “slicing” of sentences, the result of which is erroneously used to as a basis for seach.

I’m not aware if the problem already exists in previous OS versions. This is the first time I personally play around with search in DT.

I appreciate that the DT team has been keeping up to the latest OS release. My apologies if my concern adds to an already substantial pile of emerging problems.

BLUEFROG · September 28, 2023, 12:16pm

This is in the next release of DEVONthink To Go…

Good?

meowky · September 28, 2023, 12:21pm

That would be great! I’d also like to see the same fix applied to DT for Mac.

Thank you.

BLUEFROG · September 28, 2023, 12:27pm

Try text:~"人間相互" on the Mac.

meowky · September 28, 2023, 12:43pm

The tilde doesn’t seem to work. I try to match the 2nd and 3rd characters in my sentence 美国是一个大国, without success.

However, if I mark up 国是 (with **strong**, for example) in my note, it works.

My guess: the markup forces the program to “rethink” how to “slice” the sentence by separating 美 (first character) and 国, which it would otherwise semantically join together. The problem is, the program should never attempt to semantically join characters together in the first place.

chrillek · September 28, 2023, 1:05pm

Thanks a lot for this detailed explanation. Very enlightening.

BLUEFROG · September 28, 2023, 2:44pm

I’m not seeing any issue with the documents I have here.

Hold the Option key and choose Help > Report bug to start a support ticket. Please attach the document you are trying to find and the search term you’re using.

Ripple · October 16, 2023, 1:59pm

DTTG search Chinese is completely unusable

BLUEFROG · October 16, 2023, 2:11pm

Providing concrete information, especially with example documents and search terms – screen captures included – would be much more helpful. You can select ? > Contact Us to start a support ticket from DEVONthink To Go’s databases screen.

Ripple · October 16, 2023, 5:23pm

Here is a Bookmark in Database, name is

三分钟搞懂iPhone 12发布后的设计尺寸调整-iphone12设计尺寸规范

—-
Here is query and if they find the bookmark

设计 (without pressing button )
设计 (pressed button )
name:设计
name:设计*

BLUEFROG · October 16, 2023, 6:03pm

I am seeing no issue with the exact name and search string you provided.

Are you running DEVONthink To Go 3.7.6?

Ripple · October 16, 2023, 6:29pm

Yes, I’m on DTTG 3.7.6, iOS 17.0.3

Ripple · October 31, 2023, 4:26am

The bug was fixed in version 3.7.7. Searching in Chinese now seems to work well. Thank you!

BLUEFROG · October 31, 2023, 5:24am

Excellent and thanks for the confirmation!

meowky · November 1, 2023, 2:34am

Unfortunately, DTTG 3.7.7 did not fix my issue. String “美国是一个大国” in markdown content (not document name) still refuses to match search string “国是” in any way.