I have just read through the list archive, and noted that a few people
might have some doubts about how UTF-8 is used under Unix. They
apparently got confused by many of the features described in the Unicode
standard (BOM, line separator, etc.), and thereby completely forgot the
big UTF-8 prime directive under Unix:
UTF-8 is ASCII compatible
Not only the encoding, but also the use of it. So don't change anything
about how ASCII was used when introducing UTF-8, because only this means
that UTF-8 can truly substitute ASCII in a realistic way:
This means the following:
- A UTF-8 Unix plain text file that contains only ASCII characters
(and this is the majority of files on Unix installations all over
the world) will *not* change a single bit.
- This means that there is never a BOM at the start of a file. BOMs could
be ignored by special new Unicode programs, but they are definitely
not ignored by the many existing ASCII programs. Adding a
BOM would break a tremendous amount of things and would violate the
prime directive, as BOMs are definitely not ASCII compatible.
- This means that lines in UTF-8 plaintext files are terminated
in one and only one way: 0x0a = LF. Neither U+2028 (line separator,
introduced for use inside *.doc-style word processing binary files)
nor overly long UTF-8 sequences for LF such as 0x80 0x8a must be accepted
as line terminators, otherwise we would get into the horrible
scenario that programs start to disagree what exactly a line is
(which a whole load of new security risks associated). Programs
such as "wc -l" must on UTF-8 files without any modification
whatsoever! There is no reason to change the Unix line semantics when
moving from ASCII to UTF-8. U+2028 is treated just like any other
character and has no special meaning in a Unix plaintext file.
How do applications find out that files are now in UTF-8? Simple
applications such as cat and echo do not have to. For them UTF-8 is
just like ASCII. However, programs which count characters, position
cursors, determine character classes, use regexp, etc. have to know
about the file encoding, and there are well-established mechanisms to do
that: they are told, preferably via established POSIX mechanisms
(LC_CTYPE, LANG), or via other command line switches.
Ideally, all that should be necessary to turn a Unix installation into a
pure UTF-8 system is the addition of the line
export LC_CTYPE=UTF-8
in /etc/profile, plus conversion of the existing ISO 8859, JIS, KOI8,
etc. files and file names. Editors and terminal emulators will then
activate their UTF-8 modes, email software will convert received
messages from the indicated MIME character set into UTF-8 before saving
them as a file, etc. We are not quite there yet, but that should be the
long-term goal.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>