Hi,
Norm wrote:
I am not at all secure about how the standard GNU utilities will
handle non-ascii characters. For example, 'wc -c', just counts
bytes.
Christian has pointed out -c has remained bytes, --bytes is a synonym,
because otherwise too many things would break, and that -m has been
added to handle multi-byte characters, AKA --chars. tr(1) remains
resolutely single bytes, though the documentation talks of growing
multibyte support with a -C complement option.
$ od -c <<<←
0000000 342 206 220 \n
0000004
$
$ tr \\220 \\221 <<<←
↑
$
Things like sed and grep all work in a UTF-8 world just fine, though
often a bit more slowly, Unix having moved to it some years ago.
$ sed 'y/\220/\221/' <<<←
←
$ sed y/←/x/ <<<←
x
$
For the odd occasion when I want to remove locale specifics, I use
~/bin/C as a shorthand.
$ cat ~/bin/C
#! /bin/sh
# LC_ALL has precedence over LANG according to POSIX[1], but we may as
# well stamp out any traces by setting LANG too.
# 1. The Open Group Base Specifications, Ch. 8 Environment Variables.
LC_ALL=C LANG=C exec "$@"
$
$ C sed 'y/←/x/' <<<←
sed: -e expression #1, char 8: strings for `y' command are different lengths
$ C sed 'y/←/xyz/' <<<←
xyz
$
Ken wrote:
But since UTF-8 has the excellent property that non-ASCII characters
look like just 8-bit characters but won't ever be mistaken for ASCII
(not a surprise, since it was designed by two of the original Unix
geeks)
Ken Thompson and Rob Pike. (Pike's not quite original, but nearly.)
Rob covered its creation in a diner on a napkin back in 2012.
https://plus.google.com/+RobPikeTheHuman/posts/Rz1udTvtiMg
There's a comment by me there with a Google Streetview of the diner.
I jumped whole-hog into UTF-8 a few years ago, and I haven't regretted
it one bit.
No regrets here. You might find iconv(1) useful to convert existing
files from one encoding to another.
Cheers, Ralph.
_______________________________________________
Nmh-workers mailing list
Nmh-workers@nongnu.org
https://lists.nongnu.org/mailman/listinfo/nmh-workers