In his presentation of the Plan Nine UTF-2 work at the recent Usenix
conference, Rob Pike made an interesting point that is quite relevant
to the assorted discussions about >8-bit character sets. He said
(roughly) "the hard part is making the code understand that octets
and characters are not synonymous". Once that is done, the details --
how the two are related, whether a character is 16 or 32 bits, etc. --
are very much secondary, particularly if libraries etc. are designed
to hide the implementation details properly.
Untrue. 16 bit or 32 bit are not implementation details.
With 16 bit wchar_t, you can write
array[(unsigned)char_code]
and your program should work on most modern machines.
With 32 bit wchar_t, it is often impossible to write:
array[(unsigned)char_code]
because hardly no machine have >4GB virtual memory.
They changed from the old
10646-appendix UTF to UTF-2 in an afternoon:
They both use 16 bits.
Masataka Ohta