nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

2014-06-18 06:08:39
Hello Ken,

The Unix kernel stores filenames as a run of bytes, not including
`/' and NUL.

That's not universally true anymore.  Some newer filesystems are
mandating that filenames are UTF-8 and enforcing normalization rules
(MacOS X and Solaris are two notable examples).

Thanks, I didn't know.  Haven't used Solaris in years, and never bought
Apple.

The only way of resolving this is to use the normalization rules for
Unicode and do filename searching that way;

Sure.

MacOS X actually rewrites all of the filenames using Normalization
Form D (all characters in decomposed form, which means the regular
character followed by the combining accents) and I think that sucks,
but they didn't ask me.

I think I agree with you.

Solaris is better; the original bytes are preserved, but lookup is
done using normalized names so you can't have two filenames with the
same characters.

What about globbing, especially on Mac OS X?  Given your two examples on
Linux with bash,

    $ touch résumé résumé
    $ ls r?sum?
    résumé
    $ ls r?sum? | recode ..dump
    UCS2   Mne   Description

    0072   r     latin small letter r
    00E9   e'    latin small letter e with acute
    0073   s     latin small letter s
    0075   u     latin small letter u
    006D   m     latin small letter m
    00E9   e'    latin small letter e with acute
    000A   LF    line feed (lf)
    $
    $ ls r??sum??
    résumé
    $

Do you think NFKC would be better, so ? often matches what appears as a
single rune and fi matches ligature fi?

Cheers, Ralph.

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>