Re: 10646, UTF-2, etc.

... For languages that have string representations built in, the
operations of "header files and libraries were replaced, a big recursive
'make' was done to rebuild the software" are equivalent to "the compiler
was rebuilt and some of the language was redefined"...


Basically true, although you can prepare for the possibility (for example,
by providing a configuration switch for 16/32-bit wide characters).  I'd
note, though, that you don't need to redefine the language unless you have
been foolish enough to imbed implementation details into the language
definition.  The width of the wide-character datatype should be defined
as "wide enough to hold a character", not as "16 bits".

The other language issue is that there are languages (e.g., C) that
handle strings on a delimiter and parsing basis, while others (e.g.,
Fortran or Pascal) handle them on a length basis.  There are, again,
arguments for both approaches, but, if you have to parse and count
anyway, variable-length things such as UTF-2 encodings are lots easier
to handle at an efficiency level consistent with other string activities
than if you have to impose parsing and counting on languages that are
designed around allocated-length strings.


I think you've missed a point here.  Even in C, you will need changes
to your code to handle UTF-2 properly, in programs that need to know
the structure of strings.  But a great many programs do not need this.
Such programs, be they in C or in Pascal, need know nothing about UTF-2,
because to them the string is just a bunch of octets, and the fact that
there isn't a one-to-one mapping between octets and characters simply
isn't relevant.

A program that really needs to be intimate with the structure of strings,
be it in C or Pascal, will almost certainly end up converting from UTF-2
to a fixed-width representation at input time.

The only programs that will see an efficiency difference are an (I would
conjecture) small class where some understanding of string structure is
needed occasionally but full-scale conversion is not.

a little program ran around finding UTF disk files
and converting them in place.

  Interestingly, with essentially untyped files, I don't know how to do
this in UNIX except heuristically.


Please note that Plan Nine is not UNIX.  However, for this purpose, the
two are fairly similar, and I assume they did do something heuristic.

He also noted that there was one visible benefit from switching to
UTF-2:  a lot of bugs disappeared.

  Any change in data representations followed by testing will focus the
mind and cause this type of improvement...


This improvement came *without* focussing of minds, as a purely mechanical
result of adopting a more robust encoding.  That's my understanding of
what he said, anyway.  It's largely a side issue, but it does indicate
that UTF-2's claimed minimization of damage to existing code is not
imaginary.

                                         Henry Spencer at U of Toronto Zoology
                                          
henry(_at_)zoo(_dot_)toronto(_dot_)edu   utzoo!henry