Re: Correct use of UTF-8 under Unix

Karlsson Kent - keka wrote on 1999-11-02 13:21 UTC:

(Note: I don't subscribe to perl-unicode(_at_)perl(_dot_)org, only to
linux-utf8(_at_)nl(_dot_)linux(_dot_)org, and I don't have Markus's original
message that is quoted below.)

:   - This means that lines in UTF-8 plaintext files are terminated
:     in one and only one way: 0x0a = LF.


That is not true.  "lines" in UTF-8 text files may be terminated by 
LINE FEED, CARRIAGE RETURN, CARRIAGE RETURN+LINE FEED, NEXT LINE,
or end-of-file, or be separated by LINE SEPARATOR or PARAGRAPH SEPARATOR
(which is in some sense 'stronger' than line separator).


The crucial bit of my original message that you missed was:

  I have just read through the list archive, and noted that a few people
  might have some doubts about how UTF-8 is used under Unix. They
  apparently got confused by many of the features described in the Unicode
  standard (BOM, line separator, etc.), and thereby completely forgot the
  big UTF-8 prime directive under Unix:

    UTF-8 is ASCII compatible

  Not only the encoding, but also the use of it. So don't change anything
  about how ASCII was used when introducing UTF-8, because only this means
  that UTF-8 can truly substitute ASCII in a realistic way:

  This means the following:

    - A UTF-8 Unix plain text file that contains only ASCII characters
      (and this is the majority of files on Unix installations all over
      the world) will *not* change a single bit.

  [...]

There are many nice ideas written up in the Unicode standard and the
associated technical reports, however they are not a dogma and each idea
has to be critically reviewed before you even consider introducing them
into an existing environment. It should become very quickly clear to the
alert reader of these documents that many of the mechanisms described
there (most notably the byte-order-mark and the new-line semantics) are
irrelevant for the use of UTF-8 as a backwards compatible migration path
for ASCII plaintext files on Unix systems.

Unix never had any new line ambiguity. It was always LF and only LF. It
would be really foolish for us to introduce a brand new new-line
ambiguity (via say the line separator) on Unix systems just because we
read about shiny new alternative ways in a Unicode technical report.

The original AT&T Bell Labs developers of Unix have already studied back
in 1992, how ISO 10646 is best used on Unix-style systems. They
concluded to replace ASCII completely by UTF-8 on their experimental
Unix-successor system Plan9 and reported about the excellent practical
experiences that they made in this process in a now legendary USENIX
paper, which I am sure you all are well familiar with:

  
ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz

If the outside world does something different (they always have, you
listed the three most popular other newline conventions CR, CRLF, and
NL, yourself), then we will continue to convert, either automatically or
manually, as appropriate.

The handling of newline ambiguity by C under Unix has always been a NOP.
C under Unix is to completely ignore the "b" mode option of fopen(). The
"b" option is a hack for the rest of the world to allow it to handle its
Unix incompatibilities.

I have nothing against introducing besides the normal "plain text" also
a new text file format that we could call "unformatted plain text". It
would be a stream of characters interrupted by Unicode paragraph
separator characters. The PS and LS characters would have exactly the
same role as a <P> and <BR> in HTML or a \par and \hfil\break in TeX.

Such an additional file type notion would indeed be interesting to have
available, but it would not be used for formatted plain text files such
as 

  - software source code
  - configuration files
  - shell scripts
  - everything sent to standard output

etc. for obvious reasons of backwards compatibility. An unformatted
text format (and a whole range of new tools or new modes of existing
tools to support handling it) would however be very convenient for
file types such as

  - HTML/SGML/XML
  - TeX
  - nroff

where the formatting of the plain-text file is discarded anyway. It
would save us having to press paragraph-reformat so frequently in
editors, and it would make diff files smaller, because paragraphs would
not contain any more any formatting indicators such as LF that have to
be rearranged throughout the entire paragraph is you change just a
single word. For normal "plain text" files, the process writing a
paragraph has fixed the positions of the line breaks, for "unformatted
plain text" files, the process reading the paragraphs is responsible to
think about placing line breaks. Just as in TeX, HTML, etc.

There is nothing wrong, with having these note-pad style unformatted
plain text files as well supported under Unix, but it is important to
make clear that this is an entirely new file type with no relationship
to the existing plaintext notion.

The distinction of the two file types is easy: If it contains at least
one LF character, it is a normal plain text file, if it does not contain
a single LF character (but zero or more PS and/or LS characters), then
is is a new/style unformatted plaintext file. Either way, you'll find
out soon enough when reading the file at the end of the first line
(formatted) or paragraph (unformatted).

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>