nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

2014-06-17 14:52:09
Hi Norm,

So you are saying that "normal unix commands", such as grep, wc, tr
etc, do or someday the GNU versions will, know about UTF-8, at least
for file contents,

Yes, they do, today.  And have done for quite a while.  You need your
environment variables set up properly so `locale' reports UTF-8 (or
`utf8').  Then...

    $ grep -i roman chars
    Roman numerals Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ Ⅼ Ⅽ Ⅾ Ⅿ
    $ grep £ chars
    Currency £ € cent-¢
    $ grep -i roman chars | sed -r 's/.*(.)/\1/'
    Ⅿ
    $ grep -i roman chars | sed -r 's/.*(.)/\1/' | hd
    00000000  e2 85 af 0a                                       |....|
    00000004
    $ 

if not for file names?

The Unix kernel stores filenames as a run of bytes, not including `/'
and NUL.  It places no interpretation on them itself.  Userspace is able
to do so, but two users might see different names for the same file just
as they might `see' the same text file differently if they think the
bytes represent different encodings.

    $ >pound-£
    $ ls
    pound-£
    $ LC_ALL=C ls
    pound-??
    $ 

But really, these days, the whole world is UTF-8.  Unless it's Microsoft
with their backwards backwards-compatibility view of the world, and no
one cares about them.

Cheers, Ralph.

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>