Re: 10646, UTF-2, etc.

[This is mostly a private note to Masataka, but given his public
claim about what "people in the real world" think, I feel the urge
to cc the ietf-822 list.]

In 
<9302091525(_dot_)AA21060(_at_)necom830(_dot_)cc(_dot_)titech(_dot_)ac(_dot_)jp>,
 Masataka Ohta wrote:

[Henry Spencer had written:]

Not on my pdp11, you can't!

Plese do read and quote my mail appropriately.
      With 16 bit wchar_t, you can write
              array[(unsigned)char_code]
      and your program should work on most modern machines.
Are you saying your pdp11 is a major modern machine?


It's foolish to unilaterally disregard the existence of machines
which one does not personally consider "major and modern" or to
deliberately write code which will not run on them.  Habitually
writing code which is portable to popular yet older machines
makes one's code more portable to newer machines as well.
(I am admittedly biased; I also use a PDP-11 daily.)

But that's beside the point, because the IBM PC is an allegedly
modern machine which does share the PDP-11's restriction of
single objects to <= 64K.

But that's beside the point, because the point of the argument is
that indexing by a wchar_t is even less likely to work if a
wchar_t is 32 bits, so one probably shouldn't be indexing by
wchar_t's in the first place.

But that's beside the point, because whether or not a wchar_t is
16 or 32 bits, and whether or not using it as an array subscript
works, an array with at least 65536 elements is likely to be a
big waste of space and one should be seriously considering using
sparse array techniques anyway.

But in any case, this misses the point.  You can always write code that
depends on details, if you try.  The point is that if you make a modest
effort to write clean code, then 16 bits vs 32 bits *is* a detail...


You are too much pedantic.
People in the real world does not think so.


Not at all.  Perhaps I am a pedant or not in the real world,
either, but I certainly think so (i.e. that 16 vs. 32 bits for a
wchar_t is a detail, and that this argument is silly and missing
the point).

It was the assumptions of Unicoders that a datatype for characters is
small enough to be able to be represented with 16 bits.
Now, the assumption was completely unfounded and totaly wrong in
several ways. But, how can you say that those who use Unicode do not
assume 16bitness for a datatype for characters?


Your bias against Unicode is well known, but constructing
imaginary arguments against it serves no one's purpose.

First of all, given Unicode's definition of "unification," with
which you happen not to agree, choosing 16 bits was perfectly
reasonable.  (In my opinion, using 32 bits for *character* data
is absolutely ludicrous, but that's beside the point.)

Secondly, Henry's original statement was that

        Rob Pike... said (roughly) "the hard part is making the
        code understand that octets and characters are not
        synonymous".  Once that is done, the details... are very
        much secondary, particularly if libraries etc. are designed
        to hide the implementation details properly.

This was an assertion neither that there would be precisely zero
problems encountered anywhere when making a hypothetical move
from 16 to 32-bit wide characters, nor that no programmers
anywhere might occasionally have it somewhere in their heads that
wide characters might be "sixteen" rather than merely "a lot of"
bits.  It was merely a suggestion (which ought to be accurate)
that once we've taken the plunge, pulled teeth, and divorced
ourselves from the "one character == 8 bits" notion, other
changes should be relatively painless, and "particularly if
libraries etc. are designed to hide the implementation details
properly".

Rewriting code to allow wide characters can be a lot of work;
even programmers who know better than to hardwire data type sizes
find it very easy to make assumptions based on 8-bit characters.
But, while going through the pain of removing those assumptions,
only a fool would introduce new assumptions that 16 bits/character
were equally inviolate.  (Let me hasten to admit, alas, that
there are however a lot of foolish programmers out there.)

                                                Steve Summit
                                                scs(_at_)eskimo(_dot_)com