File encoding problem

chrillek · January 25, 2022, 2:43pm

Exporting as what? When I export an MD file with umlauts “as text”, I get a file selector that permits to set the file encoding explicitly.

In my case, it is set to “automatic”, which in fact results in the same broken output as in your case. However, if I set it the encoding to “UTF-8”, everything is fine.

In fact “automatic” seems to mean “Macintosh-Roman” (at least that’s what the outcome is). Huh? Seriously? 2022? Who tf is even using this stuff anymore? I suppose that this is rubble from the olden days… In any case, if you select “UTF-8”, you should be on the safe side.

The exact same thing happens, btw, when one exports the “Document”, but in that case the save dialog does not provide an encoding option.

Maybe setting the “Preferences/Files/Import/Encoding” to UTF-8 helps in both cases?

@cgrunenberg: According to the documentation, “automatic” in the preferences means “let DT choose the best encoding” – is that still “Mac Roman” and/or does the setting here influence the encoding used for export? And: Is it conceivable that the system-wide settings for language etc. influence this? It seems the OP as well as I are working in a german locale, whereas at least @Stephen_C and @rmschne are probably more US or UK.

OTH: Why do you even export the file to use it in Obsidian? IIRC (and I may be wrong) it seems that the usual approach in having DT and Obsidian work together involves indexing the Obsidian vault in DT?

cgrunenberg · January 25, 2022, 2:46pm

The automatic export encoding isn’t static, the smallest one is preferred. Might be e.g. ASCII, Mac Roman, UTF-8 or UTF-16.

chrillek · January 25, 2022, 3:05pm

Is that really reasonable nowadays (and is Mac Roman really “smaller” than Latin-1)? If I understsand you correctly, DT checks if another encoding than UTF-8 results in a smaller file size? That would, I think, break a lot of things (MD for example, but also HTML/XML). And why would it do that?

At the very least, I’d expect “automatic” to mean “unchanged” since there are options to change the encoding if that’s really necessary. And this “smallest” encoding would not explain why the OP and I see the characters changed whereas you and others don’t (or does it?).

BTW: If I convert the MD file to HTML, everything is fine (ignoring the outdated DOCTYPE declaration). There’s even a charset declaration for utf-8. But when I export that as text with “automatic” encoding, I get Mac Roman again. Whereas exporting the very same HTML file as document gets me an unmodified HTML file.

Let’s say that this behaviour is at least not very consistent? “Export as document” should not modify anything, and certainly not without giving the user a chance to prevent that. And implicitly changing the encoding of text files on export if the poor user decides to go with the seemingly innocent “automatic” is one of the surprises that might not be welcome.

cgrunenberg · January 25, 2022, 3:33pm

The encoding is actually determined by macOS, DEVONthink doesn’t try every possible encoding on its own And this affects only File > Export > As plain text… which of course supports the possibility to choose the desired encoding.

GoetzLi · January 25, 2022, 6:22pm

Exporting as what? When I export an MD file with umlauts “as text”, I get a file selector that permits to set the file encoding explicitly.

Yes, I am aware of the possibility to export a markdown file as text including the choice of the desired file encoding. I tested this and it works fine. Selecting UTF-8 as encoding generates a file which is correctly storing special characters (umlaute) and this file is recognized by Obsidian properly. However, the drawback of this method is, that I cannot set the file extension to “.md”, I am stuck with “.txt”, which is not what I want.

OTH: Why do you even export the file to use it in Obsidian? IIRC (and I may be wrong) it seems that the usual approach in having DT and Obsidian work together involves indexing the Obsidian vault in DT?

I am actually in the decision process where to store my notes: Internally in Devonthink’s database or externally in the file system and indexing the files in Devonthink. Devonthink’s input possibilities offer a fast method to generate fleeting notes on the fly. It is often stated that it is possible to get everything stored in Devonthink out again without losses, and this is what I tested in order to find a future proof workflow. Now it seems that at least when it comes to pure ascii files containing special characters there might be a problem when trying to get these files out of Devonthink.

My workaround is to index the folder where my notes are stored in Devonthink. All new notes generated in Devonthink have to be generated in this indexed folder (and not e.g. in Devonthink’s input and later exported). In this way a file in the file system is created directly and this file is encoded as desired.

BLUEFROG · January 25, 2022, 11:57pm

What font are you using?
Are you using a custom stylesheet?

GoetzLi · January 26, 2022, 6:30am

I tested it both with and without a custom stylesheet - no difference.
Without a stylesheet the font is Times Roman,

GoetzLi · January 26, 2022, 6:42am

Another work around: I generate a markdown file within Devonthink, e.g. in the inbox using the Devonthink sorter. Exporting the file via the Export → Document command in the menu leads to a file in the file system with corrupted special characters (umlaute). However, moving the file to the file system by drag and drop does not corrupt the special characters. The same can be achieved by moving the file in Devonthink into an indexed group.

cgrunenberg · January 26, 2022, 7:53am

This command does indeed not retain the current encoding. The next release will fix this. All other possibilities (File > Export > Files & Folders…, drag & drop, Data > Open With or indexing etc.) work as expected and File > Export > As Text… includes an option to choose the desired encoding.

GoetzLi · January 26, 2022, 8:36am

Thank you for this explanation and for your help. I have a work around so I can live with this.

Best regards,

Goetz

rfog · January 26, 2022, 9:22am

The encoding of a text is done via some educated guesses on opening the document, and it is a “statistical” stuff. Normally it hits the ball, but not always.

chrillek · January 26, 2022, 9:25am

I can understand that. But on saving the original encoding should be retained unless the user changes it. Even if @cgrunenberg say that the „automatic“ option behaves as designed
Changing it behind the scenes without user interaction is bad design, IMHO.

cgrunenberg · January 26, 2022, 9:47am

File > Export > As Text… supports many file formats and multiple selected items, there’s actually no such thing like an original encoding in this case. File > Export > Document… should of course retain the encoding (and will soon).

chrillek · January 26, 2022, 10:18am

Then maybe explaining the behaviour of “automatic” encoding in this case in the documentation might be helpful.

…and it is exporting all of them in one single merged text file (as per the documentation).

Which still makes me wonder why anyone would want that to be in Mac Roman instead of UTF-8 (or any Unicode variant). I love the minimum surprise rule

cgrunenberg · January 26, 2022, 10:36am

In that case I’d recommend to simply choose the desired encoding