Text encoding issues

jwiegley · January 11, 2007, 1:27pm

Hello,

Unfortunately for me, I have an enormous number of text files that use two basic encodings: Latin-1 and UTF-8. Which uses which is hard to say, since they are all scattered together and hard to separate.

In DTP I have the import text encoding set to “Automatic”. However, when I drag in a plain text file that’s encoded in either Latin-1 or UTF-8, I see garbage characters in the resulting entry instead of accents. The particular garbage characters used depends on what the text file’s encoding really was.

You can test this with the following file:

johnwiegley.com/Test.txt.gz

Once that fails ot import, try converting it to UTF-8 (I used TextMate), and then drag-and-drop again. Still it doesn’t work! I thought that by setting the import encoding to “Automatic”, DTP would Do The Right Thing, no matter what the input encoding was.

As things stand, the only way I’ve found to preserve accents is to manually convert the text file to UTF-8 outside of DTP, and then to change my import text encoding to be only UTF-8, and then to import the file.

Any ideas?

John

cgrunenberg · January 11, 2007, 1:39pm

As there are lots of different text encodings (more than 100), it’s unfortunately not always possible to recognize the proper encoding. Only ASCII files and files with Unicode header bytes are recognized reliably.

jwiegley · January 11, 2007, 2:13pm

But surely Latin-1 and UTF-8 are extremely common? Emacs is able to identify which encoding a file uses very reliably, and without a BOM. This is actually how I managed to convert all of my files over to UTF-8 – by letting Emacs only convert the ones that needed it.

John

Maria · January 13, 2007, 6:11am

Hi,

even BBEdit has problems with automatic identification of simple UTF8 or Latin-1 files. At least in my experience. I do not trust the automatic conversion and have set up my environment to UTF8 default in any case. I used Cyclone, which is scriptable, when I had to convert a bunch of files from other encodings to UTF8.

Maria