A bug with Japanese word indexing

I’ve recently started to use DT. It seems real promissing. Then, while I was figuring out if I could use it as my primary info collecting software, I’ve encoutered some bugs with Japanese language.

DT development team might be aware of this issue, but I cound’t find any info in the forum. So I’ll report it here.

What I’ve noticed is that DT’s “See Also” and “Classify” doesn’t work well with Japanese documents, although English documents are working fine. When I try to use these functions, the list only shows a very limited number of related documents or groups that are written in Japanese.

DT seems to index Japanese words only when the individual words are separated by single-byte white space and/or line break. I’m assuming that DT’s logic to recognize words is based on Western language which uses single byte white space, periods and commas etc.

Wiki-style auto-linking function has the same problem bacause DT can not find out where’s the end of a word in Japanese. StickyBrain seems to have the exact same issue going on with Wiki-style links.

In Asian languages like Japanese, Chinese and Korean, we don’t use white spaces to separate words. We just keep writing without any spaces (with occational periods and commas etc). This probably is the core reason why DT can’t correctly index Japanese words.

  • Global search works totally fine (by selecting “phrase” in the search options, like suggested in your FAQ).
  • Character count works fine with Japanese (double-byte characters).
  • Writing documents in Japanese works totally fine.

“See Also” and “Classify” probably are what make DT very unique. I would be really happy if DT team can make a modification to the word indexing logic and make it compatible with Japanese characters. There probably are some ways to index Aisan double-byte words and documents.

In any case, thanks a million for a fabulous software!
I’d love to see DT to go 100% compatible with Asian languages in the future.

I’ve noticed the same problem with both Chinese and Japanese documents, and unfortunately it seems a difficult problem to solve. With languages like these, which do not use spaces to mark word boundaries, DT seems to be able to recognize punctuation, but treats anything between two punctuation marks as a single word. Of course this is practically useless for Classify and See Also, unless you have documents that contain exactly the same sentence or phrase between, say, two commas.

At the user level there seems to be little or nothing to do. At the developer level, one way to partially solve the problem would be to have DT work in conjunction with an external or internal dictionary, in order to properly recognize what is a Japanese or Chinese “word”. This might work quite well with Chinese (where morphology is virtually inexistent) but would give big headaches with Japanese (which has a rich variety of inflections, complicated by the use of colloquial and polite forms at the morphological level, etc.).

Another, and easier, possibility would be to lower the minimum length requirement for what DT considers to be a word. If you look under Tools > Concordance, you will notice that the list contains words of at least 3 letters for roman languages, and at least 3 graphs for Chinese and Japanese. At the bottom of the Concordance window, there is a box to enter the length of words, but the lower number cannot be less than 3; if you enter 1 or 2, it goes back automatically to 3. For a user of Chinese and Japanese documents, it would be helpful to optionally ignore both the punctuation mark and the minimum length requirements, leaving DT happy to make a list of every single graph that appears in the document, and to use that list for Classify and See Also.

There are three drawbacks, however, that immediately come to mind with this approach. The first is that DT would also index words like “of”, “at”, and “to” in an English document. But this could be solved by limiting the “one letter/graph” option to double-byte languages.

The second, and more serious, drawback is related to the nature of the Japanese and Chinese languages, where there is a difference between a word and the graph(s) used to write it. For example, if one of your documents contains the word “Tokyo” written in Japanese, and you use Classify and See Also to find other documents that contain this word, the search would return any document that contain the graphs “to” and “kyo” whether they are used in the compound word “Tokyo” or in other words. This would definitely not help, unless DT is supported by a dictionary that helps it to understand that “to” followed by “kyo” makes “Tokyo”.

(An old Mac OS Classic concordance softare, called CONC, was able to index documents based on a user-supplied list of words. The list could reside on an external file, so the idea of an external dictionary to use in conjunction with DT is perhaps not so extravagant.)

The third drawback is specific to the Japanese language. With the “single graph” option enabled, See Also and Classify would get a large amount of irrelevant results based on the occurrence of kana, including grammatical words like “wa”, “no”, and so forth. Would it make sense to leave kana out of the word list? Probably not, or you would also skip foreign words written in katakana.

So, for us users of Chinese and Japanese documents, it’s better to forget these functions of DT and enjoy the rest. I don’t think we should call this a DT “bug”, but it certainly is an inconvenience.

This is my undertanding of the problem, anyway, and it would be useful if the DT people, when they have time, offer any further explanations, and perhaps let us know if there are plans to go beyond this limitation.

I am using DEVONthink for a couple of days. It seems to be working fine as for extracting Japanese words.
(I am using DEVONthink 1.9.2 and MacOS X 10.3.8.)

DEVONthink recognises Japanese words even thought they are not separated by spaces, some of them are ridiculously long for a word though. I think it is acceptable because it is hard to recognise Japanese words for a computer.

However, I think I found another big problem.
DEVONthink won’t find Japanese Zen Kaku (2-byte) Roman alphabets and numbers such as “iPod” or “599”, which are used very often in some Japanese Newspaper websites like
asahi.com/business/update/0224/079.html

When I try to find those words in search dialogue, I find DEVONthink just says there’s no such word.
The search engine won’t find “iPod” by both search keywords “iPod” and “iPod”. They are just ignored.

I hope this problem will be solved soon.

Hi,

I am happy to see more people sharing my problems. I had a longer discussion about that issue with Christian last spring, and he already did a lot to improve the functionality. Still, it is not so far that it can be used for classification, a tremendous backdraw.

I use DT with the same attitude that someone else expressed in this thread:

On the other hand, apps like CircusPonies Notebook indeed can separate words, they create great indices. I asked Christian to check that, sure he did, and I hope that he will do that again as soon as ver 1.9 is out. There are quite some Chinese, Korean and Japanese on earth who could make good use of this function. :slight_smile:

Best,
Maria

thank you for all the feedback. We’re aware of the issues related to asian languages but probably a solution won’t be available before v2.0 (as this version will revise many things).

i live in taiwan, about half of my files contain chinese

i just bought DT 3 days ago

i didnt really figure out this problem until yesterday i try to search something but got nothing.

hopes that v 2.0 will find a solution , even a option to search character by character in chinese would be nice

i do like this software anyway. thanks