Importing/indexing Japanese files

I am still getting to grips with DP, so I hope I am not asking a too elementary question.

I have a large number of Japanese text files that I want to either import or index to facilitate searching. However, with both importing and indexing the files appear in garbled characters. I have tried changing the encoding and font options, but to no avail. What am I doing wrong?

TIA for any help,

Rolf

I suppose you’re using Mac OS X with English language on top. I sometimes deal with japanese text, but have never been in Mac OS X with Japanese language on top.

Can you see the contents of those files in TextEdit.app? Are those files really plain text files? Or rich text files? If you tried Preferences/Import/Encodings, that’s for only plain text files as far as I know.

Do you know what kind of encoding was used on those files? There are at least three other encodings in Japan, besides Unicode.

The files are plain text files in Shift JIS encoding and don’t display properly in TextEdit either. (I usually use LightWayText to view them in OS X.)

Rolf

Rolf, now, I see your point. I’ve just tested Shift-JIS encoded text file and tried to import the file with encoding Shift-JIS set in Preferences/Import/Plain text file.

And it did NOT work.

You’re right. I’ve been dealing with Unicode only and didn’t realize this problem.

The reason I asked you about opening files in TextEdit.app was that I believed, DTP’s text encoding/decoding engine was same as TextEdit.app. You can open your files in TextEdit.app. Try this, if you don’t mind.

Run TextEdit.app
File/Open
Choose Shift JIS in Plain Text Encoding drop down menu
Select a file

I’ve tested this with Shift-JIS encoded tex files and could view the file. However, when I tried to import this with the same encoding in DTP, the result was illegible.

I’m a Korean and we use several other encodings, too. After Shift-JIS, I’ve tested with a different Korean encoding except unicode, but it didn’t work, either. This appears to be a bug in DTP. I have no idea what Preferences/Import/Encoding is doing when TextEdit.app’s working.

I’ll write a follow-up in detail soon. I’ll probably test other western encodings. I hope someone in development department could confirm this problem.

Rolf, you didn’t do anything wrong.

Okay, here is the follow-up. Please consider this a [color=red]bug report.

I’ve tested encodings as follows.

Korean EUC
Japanese Shift-JIS
Traditional Chinese

None of them was decoded through Preferences/Import/Plain text file/Encoding option with its encoding. This could be a very serious problem to who have to deal with text files in those encodings. Can someone who’s in charge of this part of development answer and verify this problem? If you’d like to have test files, let me know where to send. I’ll provide some samples.

I’m not sure of Chinese encodings, but I know for a fact, Korean EUC and Japanese Shift-JIS are still dominant encodings because they have been using these encodings for a long time even before and after Unicode era. I believe in Unicode, but the reality is that Unicode is not a major player in CJK(China, Japan, Korea) yet.

Thanks for the feedback and the confirmation that I wasn’t imagining the problem. And yes, you’re right, the files do open correctly in TextEdit if one chooses Shift JIS encoding from the drop-down menu.

Let’s hope this problem can be solved. Until then I suppose I’ll have to stick with MgrepApp for searching these files.

Rolf

It can be solved if the developers of DTP wants to hopefully when they realize this glitch in TEC(Text Encoding Converter). As far as I know TEC is provided by Apple, and utilized by many Mac OS X apps, including TextEdit.app. Until then any Chinese, Korean, Japanese user who wants to import text files encoded with other than utf-8 should beware [color=red]DTP’s TEC is not working as it should be. I hope this little thread could save your time and energy.

In the mean time, you could convert existing text files’ encoding into utf-8 using Cyclone.

free.abracode.com/cyclone/

Japanese might have a better tool than Cyclone, but this is the only tool I know unless you want to use ‘iconv’ in Terminal.app.

This app can handle multiple files, though its user interface is not very comfortable.

Just for your information, if you want to try converting Shift-JIS to Unicode using Cyclone, use the following setup.

Under Input Text Encoding,
Standard - Other
Encoding - Japanese Shift-JIS
Variant - Basic

Under Output Text Encoding,
Standard - Unicode
Encoding - Unicode

This way you can search files in Spotlight (and import files in DTP until DTP guys fix TEC problem).

Many thanks for reminding me of Cyclone. (I didn’t realise that there was an OS X version.) This should solve my problem for the time being.

Rolf

Thank you for the report, we will look into this. Please could you send some example files (if possible) where you state the encoding and maybe a PDF printout (so we know what it should look like) to our support email address (you can use the “Help->Support” menu item for this). Please use the subject: “bug #40”.

Thanks!

annard, I’ve sent example files regarding bug #40 to the support. Thanks for your attention.

I was hoping that this bug might have been fixed in DTP 1.2, but it does not appear to have been. Is there any possibility of having it fixed in the future?

Rolf

I’ve been importing a lot of Chinese web archives. They’re usually encoded in GB. The pages import correctly but the titles show up garbled. DEVONThink could read the title of the page and then set the proper title.

I also have this issue being dealt with thru email.

Could you send me an example archive or URL? Then I could check this over here - thank you!

Sure. I also emailed you a few minutes ago. Any GB link like this:

thebeijingnews.com/

I haven’t tried Big5 but I assume any non-unicode Asian language would act like this.

I’m using the bookmarklet to archive from Safari. I’m not sure if other methods would work the same.

This problem doesn’t seem to exist with Chinese files. I’ve just tried to import a TextEdit RTF file into DT four times, using each time different Unicode and non-Unicode fonts for both traditional and simplified graphs. DT displays all four files perfectly. Searches also work fine.

Then I’ve tried to paste one of the four files (in the Apple Li Song font) into a Word .doc document and to import it into DT. It’s perfectly visible – althought DT (or rather the TextEdit engine of DT) displays it in a different font.

Then I’ve tried to save one of the four TextEdit files as pure text (Format --> Make Plain Text) and to import again. This time DT imported the file, but it was entirely blank in the browser window.

In fact it seems just impossible to import pure text files in Chinese into DT, including, unfortunatey, Nisus Classic files (which in my computer are still many hundreds; these files display as pure ASCII gibberish). The easiest options seems to be:

(1) Copy a text file (such as Nisus Classic) into TextEdit, and import that file into DT.

(2) Convert a TextEdit text file to “rich text” (Format --> Make Rich Text), save it, and import it into DT.

(3) Create a new “rich text” file within DT (Data --> New --> Rich Text) and paste your document into it.

If there are no footnotes or other special formatting, all these options seem to work fine with Chinese files. You might try to see if they also work with your Japanese files. In theory, they should.

I’ve tried to save that page as a web archive and drag the archive iinto DT. I can read it fine, and the title is displayed properly.

Thanks for the suggestions. (I have a lot of Nisus Classic files too.) All three methods work with Japanese files as well. The only problem is that it would be rather time-consuming doing this with hundreds of individual files. Let’s hope that the import bug will be fixed in the not-too-distant future.

Rolf

Rolf and others with the same problem,

did you try Cyclone? It works fine for me, and it is scriptable. Years ago I had a small AppleScript Application that can convert folders with Cyclone from various formats to others. Send me a pm if you cannot get along scripting Cyclone, I will check my older CDs.

Best,
Maria

I did use Cyclone many years ago, but have never tried scripting it. For the time being I’m using TextWrangler to search multiple files of this type. It has the added advantage of showing the search term in context in the results window, an option I would love to see implemented in DTP.

Rolf

Has there been any progress on this issue? The last message to this thread seems to be from about a year ago.

I am currently evaluating DT for my own use, and the issue of importing Chinese, Japanese and other mixed roman/CJK-language files is of fundamental importance to me. (I work on Chinese and Japanese Buddhism.)

Incidentally, thanks for the intros to Cyclone and TextWrangler. The first I had not heard of and will try immediately; the second I have had on my computer for a while, but never really used.

– John