I am not at all secure about how the standard GNU utilities will handle
non-ascii characters. For example, 'wc -c', just counts bytes. True,
the man page talks about bytes, not characters, but I am still left
uncomfortable. Then there are the dozens of bash, python, and perl
scripts that I have accumulated over the years.
My experience has been that a modern system handles 8-bit characters just
fine.
Now, where things get a little tricky is with multibyte character sets
like UTF-8. Not everyone has broken from the paradigm that 1 byte == 1
character, like you noted (we had to do a bunch of work in the format
engine to fix that). But since UTF-8 has the excellent property that
non-ASCII characters look like just 8-bit characters but won't ever
be mistaken for ASCII (not a surprise, since it was designed by two
of the original Unix geeks) I haven't come across a program where it
truely breaks. I don't write in Python, but Perl support for UTF-8 is
excellent and I would be shocked if the situation for Python wasn't the
same.
I jumped whole-hog into UTF-8 a few years ago, and I haven't regretted
it one bit.
--Ken
_______________________________________________
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers