nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

2014-06-17 15:11:52
if not for file names?

The Unix kernel stores filenames as a run of bytes, not including `/'
and NUL.

That's not universally true anymore.  Some newer filesystems are mandating
that filenames are UTF-8 and enforcing normalization rules (MacOS X and
Solaris are two notable examples).  Obviously some charset conversion is
happening for non-UTF-8 locales.  I think that's inevitable, given the
issues with composed and decomposed characters.

For example, let's say you see this:

% ls
Résumé.txt      Résumé.txt

How can that be?  Well, they aren't the same sequence of bytes.  In the
first one the “é” is U+00E9.  In the second, it's U+0065 U+0301 (a regular
“e” followed by a combining accent character).  The only way of resolving
this is to use the normalization rules for Unicode and do filename
searching that way; MacOS X actually rewrites all of the filenames
using Normalization Form D (all characters in decomposed form, which
means the regular character followed by the combining accents) and I think
that sucks, but they didn't ask me.  Solaris is better; the original bytes
are preserved, but lookup is done using normalized names so you can't
have two filenames with the same characters.

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>