Re: 10646, UTF-2, etc.

In his presentation of the Plan Nine UTF-2 work at the recent Usenix
conference, Rob Pike made an interesting point that is quite relevant
to the assorted discussions about >8-bit character sets.  He said
(roughly) "the hard part is making the code understand that octets
and characters are not synonymous".  Once that is done, the details --
how the two are related, whether a character is 16 or 32 bits, etc. --
are very much secondary, particularly if libraries etc. are designed
to hide the implementation details properly.


Untrue. 16 bit or 32 bit are not implementation details.

With 16 bit wchar_t, you can write

        array[(unsigned)char_code]

and your program should work on most modern machines.

With 32 bit wchar_t, it is often impossible to write:

        array[(unsigned)char_code]

because hardly no machine have >4GB virtual memory.

They changed from the old
10646-appendix UTF to UTF-2 in an afternoon:


They both use 16 bits.

                                                Masataka Ohta

<Prev in Thread]	Current Thread	[Next in Thread>
10646, UTF-2, etc., henry Re: 10646, UTF-2, etc., Masataka Ohta <= Re: 10646, UTF-2, etc., John C Klensin Re: 10646, UTF-2, etc., henry Re: 10646, UTF-2, etc., Masataka Ohta Re: 10646, UTF-2, etc., Steve Summit Re: 10646, UTF-2, etc., Masataka Ohta Re: 10646, UTF-2, etc., henry Re: 10646, UTF-2, etc., Erik M. van der Poel Re: 10646, UTF-2, etc., henry