Re: 10646, UTF-2, etc.

In his presentation of the Plan Nine UTF-2 work at the recent Usenix
conference, Rob Pike made an interesting point that is quite relevant
to the assorted discussions about >8-bit character sets.  He said
(roughly) "the hard part is making the code understand that octets
and characters are not synonymous".  Once that is done, the details --
how the two are related, whether a character is 16 or 32 bits, etc. --
are very much secondary, particularly if libraries etc. are designed
...


Henry,
  A couple of speeches are appropriate here.  Since the first is often
repeated, I'll cite it by title -- "All operating systems are not UNIX
and all languages are not C" -- and assume that everyone can fill in
most of the details.  
  The second is that Unicode, in its original, published,
pre-integration-compromise version is conceptually different from 10646:
there were no "combining sequences" or anything else that implied
variable-length characters in "flat" (pre UTF or UTF-2 or...) form.

  The implication of the second, as others have pointed out, is that,
with "full" 10646 it is quite hard, if not impossible, to tell where a
character ends by purely lexical means: at the cost of some other
difficulties (probably including forcing 32 bits) SC2 could have created
an arrangement by which certain bit masks could identify combining
sequences.  But they didn't, and, unless something has been introduced
subsequent to the 2nd DIS, knowing which is which requires reference to
the semantics of the coding tables.  And the coding tables had best be
current--if 10646.2 appears, one needs those tables too.
  Now, we could profile Internet use of 10646 to mean 10646.1 and no
combining characters, which would simplify things immensely and bring us
back toward the "Unicode, not 10646" discussions of a year ago.  But, as
soon as we do that, we lose the "universality" claim for the UCS and
there is a rational strawman alternative of sticking with 8859-n,
2022-JP, and deciding that seriously multilingual texts (not common in
the Internet mail I've seen) should be dealt with by either more 2022
profiles or in the context of SGML or other high-end versions of the
Richtext concept, not as plain text.  That would, of course, be pretty
close to where MIME sits today, plus or minus Unicode/IETF-profiled-
10646, for plain text characters.

If one doesn't exclude combining sequences, then there are no obvious
ways to represent "character I don't understand" on the screen, either
by the traditional approach of having an "unavailable glyph" convention
or by switching to, e.g., mnemonic--at least, again, without having all
of the semantics of the code tables available.

We also need to keep in mind that the putative simplicity of C arises
partially by pushing all of the hard stuff off into libraries and header
files.  For languages that have string representations built in, the
operations of "header files and libraries were replaced, a big recursive
'make' was done to rebuild the software" are equivalent to "the compiler
was rebuilt and some of the language was redefined".  One can debate the
merits of the two approaches endlessly; I'm not going to repeat that
debate here.  Curiously, a number of high-performance C implementations
unfold standard string library functions inline, making them
functionally indistinguishable from languages with defined-in strings:
if you can replace the string functions and associated header files at
all, the replacement process results in dramatic drops in performance.

The other language issue is that there are languages (e.g., C) that
handle strings on a delimiter and parsing basis, while others (e.g.,
Fortran or Pascal) handle them on a length basis.  There are, again,
arguments for both approaches, but, if you have to parse and count
anyway, variable-length things such as UTF-2 encodings are lots easier
to handle at an efficiency level consistent with other string activities
than if you have to impose parsing and counting on languages that are
designed around allocated-length strings.

a little program ran around finding UTF disk files
and converting them in place.

   Interestingly, with essentially untyped files, I don't know how to do
this in UNIX except heuristically.

He also noted that there was one visible benefit from switching to
UTF-2:  a lot of bugs disappeared.

   Any change in data representations followed by testing will focus the
mind and cause this type of improvement.  Changing the byte order in
which things are stored would have an even more intense theraputic
effect.  Do you suggest that there is something else going on here?  If
it is that [original] UTF is bad news, I don't see anyone advocating its
use on this list anyway.

    --john