Japanese in the English version of DevonThink

I live in Japan, and am delighted to see that there is now a Japanese version of DT Pro. I plan to continue using my English-language version, but it would be helpful if I could save some of my Japanese documents in DT Pro. In the past, however, DT usually mangled the Japanese text in my documents to the point where the text was unrecoverable, so I’ve restricted DT Pro to English only for the last year or more.

What is the situation now with Japanese text in the English language version of DT Pro?

DEVONthink Pro should save Japanese text (Unicode) without any problems. Searches can only be made using the Phrase operator as Japanese doesn’t use word delimiters and so searching for single words does not work at the moment.

What kind of problems did you encounter? If a document makes problems, please send it to us so that we can see if we can fix the problem.

Eric.

Hi,
i also use DTP Office with mainly English, German AND Japanese.
As many people know, Japanese WWW sites are often horribly written (i.e. missing encoding Tags in the HTML header and the like), so i often have to add a metatag in my archived Japanese websites.
Doing this, i thought if DTP’s contextual menu could be extended by adding an encoding sub-menu or another simple way to add an encoding line into the metatag of a WWW document.
Also, like many people wish in OCR related postings, a Japanese (or generally Asian language) OCR module would be greatly appreaciated.
Joha

Regarding the garbled encoding lines in Japanese HTML pages: We are trying to work around these things as good as possible. However, for bad code there will never be a “catch all.”

As for the OCR module: Asian languages are delivered by IRIS, the maker of the OCR engine, as a separate product as so we are – today – not able to offer this. We are talking to them about Asian languages as well as Hebrew, though.

Hi,

since a year or so I get along quite well with the Japanese (Chinese etc.) website problem: I only collect RTFs, clipped as UTF8 from my browser.

I do not use the DT browser since there is no way to store encodings for certain sites like in OmniWeb or change encodings on the fly like in other browsers. So I have the correct text view in my focussed RTF clip and have another problem solved: the downloading of websites, even archived ones (at least DT behaves like that). The metadata for the clip show the URL, so that I can always return to the website if necessary.

RTF clips instead of web pages make the database a lot faster, there is no garbage from the rest of the page in the file etc. This system works quite well for my purposes.

Best,
Maria

When making Japanese pages in Devonthink and save it into iPod, can it be made to save in so-called shift-jis (with a proper tag), not UTF16 or UTF8 even if the original master is kept in unicode?

This is because iPod has stupidly low 4096-byte limit and unicode files are significantly larger than sjis files of the same content. Apple and others recommend to use sjis for this reason.

is that the search function of devonthink doesn’t work well. It doesn’t work well enough to be useful.

For example, if a plain text page contains a sentence like

このプログラムの国際化は不十分である。

and you search for 国際化, you won’t find that document.

I’m guessing that the devonthink parses the text into words by looking at punctuation and whitespaces but this doesn’t work with Japanese text.

Unix users are familiar with free search engines that handles this problem well for many years, such as namazu + kakashi. Hope devonthink incorporates these technologies in handling Japanese text.

I think you will find that it works if you change the Operator in the Search window to Phrase.

Rolf

Ok, in that particular case, it does.

But then if I search for 国際化 不十分 then no option seems to make it work…

Ok, in that particular case, it does.

But then if I search for 国際化 不十分 then no option seems to make it work…
[/quote]

This is because your search strings are not one phrase. DTP only searches for one phrase. These searches are reliable, but I agree that there is a lot more tha DT can to for its international / multilingual customers.

Maria

I agree, and it may be as simple as incorporating a mechanism for dividing words out of teh text. kakashi (a free software package) has been around for many years for this goal, in case of Japanese language.

I initially bought DTP for my research and other activities (in English) so I’m not unhappy with the product, but I strongly wish that it handled other languages a little better.

Besides searching, word counts are incorrect and classify function doesn’t work either.

Ryuji,
could you contact Christian about that? He has been trying to improve the system, but I think he did not know about this possibility. I have never heard of this.

Thanks,
Maria

I am just beginning to evaluate DEVONthink, and since I use mixed roman and East Asian languages in my research, I’m very interested in your comments, which I have quoted below.

  1. I wonder if you could provide a link for kakashi, the search software program that you mentioned? I just spent a few moments trying to find it using Google (in both English and Japanese), but no luck.

In Mac OS 9 there is a wonderful program called mGrepApp, which searches a folder of files (following out aliases), and when you double-click on one of the hits it brings the file up to that very location in one’s text editor (TexEdit Plus in my case). So wonderful! But, does anything like that exist for Mac OS X, or even Windows?

  1. Could you point me to any information that help me evaluate DEVONthink? Note that I work on Chinese and Japanese Buddhism, so I have a lot of files in either UTF-8, Chinese Big5, or various Japanese encodings. I have tried experimenting a bit with loading files into my DEVONthink database, but I haven’t figured out the key to making them searchable. In some cases, the text files are imported with zero characters showing… Is there a way to fix that?

Best wishes,

– John

You can find all about kakashi here:

kakasi.namazu.org/index.html.ja

Thanks for the link to the kakashi site. I have also checked out Namazu – one of the journals I cite regularly (J. of Indian and Buddhist Studies) uses Namazu for its search engine, now that I think of it.

I tried using TextWrangler on some Japanese and Chinese text files, but so far I can’t get it to produce any search results… I’ll keep at it.

More importantly, is there any simple answer to my questions about evaluating DT for mixed roman/CJK use?

I only speak Japanese and English (unless you count Engrish as another language) so I can’t help you on mixed use of Chinese and Japanese.

I’m a long time BSD Unix/Linux/SunOS user and I use emacs on MacOS X when I edit Japanese text. If you are not afraid of emacs, try Aquamacs. Recent versions are pretty good.

As for mixed Asian text (Chinese, Korean, and --mainly – Japanese) in DT, there are no limitations in creating, importing, exporting and searching. I can recommend DT in that respect without hesitation.

The sad side of the story is that you cannot use “see also”, automatic classification etc. because DT has no means of creating indexes for meanings in different languages. This is a problem of languages, not of writing systems though. Same problem for synonymes in the same language or different transcriptions.

Maria

Did DTPO make any significant improvement on searching for multiple phrases in Japanese since last time I brought up this issue above?

I had this problem initially too. The trick seems to be to first save the document in TextWrangler in the appropriate encoding (e.g., Shift JIS) or, better still, convert the text to Unicode.

Rolf