Re: printable wide character (was "multibyte") encodings

[regarding UTF-2]

(1) It is still a variable-length encoding...


Indeed so.  I don't see any good way around that without breaking
compatibility.  Indeed, I'm not sure I see much way around it at all,
since I greatly doubt that the English-language community is going to
want to double the volume of all its text.  We will never again
have fixed-width characters; the only real decision is whether we
want to force any single message to be fixed-width... which will
cause trouble in contexts involving more than one message.

(2) It is ASCII-optimized.  To the degree to which a character [sub]set
(i.e., a 10646 "row") is close to ASCII, it gets a minimum number of
octets per character (one for ASCII itself).  Character [sub]sets that
are quite different, e.g., Asian ideographic character sets, are fairly
severely penalized, ending up not in two octets but in three, four, or
more...


Actually, the people who suffer most from this are the Europeans.  The
Asian ideographic sets are *already* going to have to use 2-3 octets
per character.  The ones most hurt by UTF-2 are those with small nonASCII
alphabets.

However, given the massive dominance of ASCII in the computer world,
I don't think it's awful Anglocentrism to say that optimization for ASCII
is a reasonable idea.  That line of thinking leads to cutting off your
nose to spite your face -- defining encodings that are equally painful
for all in a spirit of being "impartial".

... as the network becomes heavily used in Asia,
it seems to doom us to a second transition, presumably to unencoded
10646.


I don't understand this -- are you saying that the Asian users are going
to become so overwhelmingly dominant in the network that the rest of us
will have to accept 100% overhead in everything we send?  I don't see
that second transition ever happening.

Moreover, if it does, it will be no more painful -- indeed, I'd say rather
less painful -- for being postponed.  We will have already solved most of
the problems of handling 10646, leaving the transmission encoding as a
relatively minor issue.  (Plan Nine started out with UTF-1 and has changed
to UTF-2, as I understand it, fairly painlessly.)

They are properly thought of as peers, and it is reasonable and proper
to consider something like "10646-UTF-2" to be a character set...
...

 Again, I haven't seen the final formal document, but had understood
that UTF-2 was in the "appendix not part of the standard" category...


Wrong twice, actually. :-)  First, UTF-1 is what's in the appendix, and
it's another poorly-designed encoding that I wouldn't recommend.  I don't
think UTF-2 is slated to appear in the same book at all; it's too recent.

The second wrongness is that I didn't *say* ISO had blessed UTF-2 as a
peer to fixed-width 10646, only that it is reasonable and proper to view
it that way.

Perhaps like Steve, I don't believe that there is any path of no
resistance, or even a path of very low resistance...


I would urge you to read Rob Pike's paper in the Usenix proceedings
(an early version of it was available for FTP on research.att.com as
part of the Plan Nine documentation; my apologies, I don't have all
the details on location etc. handy).  Until I read it, *I* thought
the transition was going to be agonizingly painful too.  I don't now;
it won't be fun, but *if* we use UTF-2, it won't be too nasty.  There
is no need to speculate on this; there is relevant experience already.
Learning from experience is supposed to be the Internet's forte...

                                         Henry Spencer at U of Toronto Zoology
                                          
henry(_at_)zoo(_dot_)toronto(_dot_)edu   utzoo!henry