Correct use of UTF-8 under Unix

I have just read through the list archive, and noted that a few people
might have some doubts about how UTF-8 is used under Unix. They
apparently got confused by many of the features described in the Unicode
standard (BOM, line separator, etc.), and thereby completely forgot the
big UTF-8 prime directive under Unix:

  UTF-8 is ASCII compatible

Not only the encoding, but also the use of it. So don't change anything
about how ASCII was used when introducing UTF-8, because only this means
that UTF-8 can truly substitute ASCII in a realistic way:

This means the following:

  - A UTF-8 Unix plain text file that contains only ASCII characters
    (and this is the majority of files on Unix installations all over
    the world) will *not* change a single bit.

  - This means that there is never a BOM at the start of a file. BOMs could
    be ignored by special new Unicode programs, but they are definitely
    not ignored by the many existing ASCII programs. Adding a
    BOM would break a tremendous amount of things and would violate the
    prime directive, as BOMs are definitely not ASCII compatible.

  - This means that lines in UTF-8 plaintext files are terminated
    in one and only one way: 0x0a = LF. Neither U+2028 (line separator,
    introduced for use inside *.doc-style word processing binary files)
    nor overly long UTF-8 sequences for LF such as 0x80 0x8a must be accepted
    as line terminators, otherwise we would get into the horrible
    scenario that programs start to disagree what exactly a line is
    (which a whole load of new security risks associated). Programs
    such as "wc -l" must on UTF-8 files without any modification
    whatsoever! There is no reason to change the Unix line semantics when
    moving from ASCII to UTF-8. U+2028 is treated just like any other
    character and has no special meaning in a Unix plaintext file.

How do applications find out that files are now in UTF-8? Simple
applications such as cat and echo do not have to. For them UTF-8 is
just like ASCII. However, programs which count characters, position
cursors, determine character classes, use regexp, etc. have to know
about the file encoding, and there are well-established mechanisms to do
that: they are told, preferably via established POSIX mechanisms
(LC_CTYPE, LANG), or via other command line switches.

Ideally, all that should be necessary to turn a Unix installation into a
pure UTF-8 system is the addition of the line

  export LC_CTYPE=UTF-8

in /etc/profile, plus conversion of the existing ISO 8859, JIS, KOI8,
etc. files and file names. Editors and terminal emulators will then
activate their UTF-8 modes, email software will convert received
messages from the indicated MIME character set into UTF-8 before saving
them as a file, etc. We are not quite there yet, but that should be the
long-term goal.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>