Sanitize / normalize file names

marekkowalczyk · October 21, 2020, 10:25pm

I prefer to have my file names normalized to ‘pandoc-types-types-for-representing-a-structured-document’ rather than the raw ‘pandoc-types: Types for representing a structured document’, which can cause a lot of problems.

So I put together a little script that normalizes file names, while saving the original name into the Finder Comments field.

tell application id "DNtp"
    repeat with thisRecord in (selection as list)
    	set theName to the name of thisRecord as text
    	set the comment of thisRecord to theName
       	set theNewName to do shell script "~/go/bin/sanitize " & "'" & theName & "'"
    	set the name of thisRecord to theNewName
    end repeat
end tell

You can find the ‘sanitize’ command on my Github. It is written in Golang.

chrillek · October 22, 2020, 7:56am

Out of curiosity: why do you only convert Ł/ł to their ASCII “equivalents”, not the other Polish diacritics like the n and e with cedilla? I suppose that nowadays all filesystems support Unicode, so why get rid of Ł/ł?

marekkowalczyk · October 22, 2020, 9:08am

Actually, all non-ASCII chars will be transformed to their ASCII equivalents, e.g.:

'Kąt na łące żre źrebię' → 'kat-na-lace-zre-zrebie'

This is achieved by the function

runes.Remove(runes.In(unicode.Mn))

which strips all unicode runes of the [Mark, Nonspacing] characters they are possibly combined with. E.g:

ą

is actually

combined with

0328 Below_Attached # Mn ( ̨ ) COMBINING OGONEK

(see here for a complete list of Mn characters).

Curiously, however, unlike all other Polish diacritic characters,

Ł

and

ł

are not created by combining

or

with any [Mark, Nonspacing] character but are characters of their own. Therefore they need to be handled separately, as

 runes.Remove(runes.In(unicode.Mn))

will not work for them, i.e.,

runes.In(unicode.Mn)

fails on them and therefore there are no runes to remove by

 runes.Remove()

I hope this is not too convoluted an explanation.

chrillek · October 22, 2020, 9:18am

Not at all, thanks for the explanation. I have no knowledge of Go, so I thought that you were handling only the upper/lowercase Ł. I suppose that one can not combine L with any diacritical mark to get Ł, because there is no such mark (and please excuse me for using “cedilla” - I didn’t think of ogonek)
The other question however, remains: Why change these unicode characters to an ascii equivalent? Does it have something to do with Apple’s weird decision to use non-combining characters instead of the combining ones?

marekkowalczyk · October 22, 2020, 9:48am

The idea was triggered by my obssesive-compulsive nature and a sense of aesthetics. I really hate those weird filenames.

Also, this ensures filenames are safe for any existing or future filesystem (sans filename length, which is easy to fix) — long-term archiving and portability.

The use of marking nonspacing characters is actually a feature of Unicode itself and as such is present in all filesystems supporting Unicode names, not just macOS.

chrillek · October 22, 2020, 10:10am

Ah. In the eye of the beholder I actually like spaces and special characters in file names. Always found ASCII too limited.
I’m aware that combining characters are part of Unicode. But afaik only Apple chose those for its file names. Which leads to some interesting behavior if one tries to use such files in WordPress on a Linux system. Or: 28 years after Unicode, we still can’t handle accents: PDF + macOS + URL = chaos – The Eclectic Light Company
Quite entertaining.