10646, UTF-2, etc.

In his presentation of the Plan Nine UTF-2 work at the recent Usenix
conference, Rob Pike made an interesting point that is quite relevant
to the assorted discussions about >8-bit character sets.  He said
(roughly) "the hard part is making the code understand that octets
and characters are not synonymous".  Once that is done, the details --
how the two are related, whether a character is 16 or 32 bits, etc. --
are very much secondary, particularly if libraries etc. are designed
to hide the implementation details properly.  They changed from the old
10646-appendix UTF to UTF-2 in an afternoon:  header files and
libraries were replaced, a big recursive "make" was done to rebuild
the software, and a little program ran around finding UTF disk files
and converting them in place.

He also noted that there was one visible benefit from switching to
UTF-2:  a lot of bugs disappeared.

                                         Henry Spencer at U of Toronto Zoology
                                          
henry(_at_)zoo(_dot_)toronto(_dot_)edu   utzoo!henry

<Prev in Thread]	Current Thread	[Next in Thread>
10646, UTF-2, etc., henry <= Re: 10646, UTF-2, etc., Masataka Ohta Re: 10646, UTF-2, etc., John C Klensin Re: 10646, UTF-2, etc., henry Re: 10646, UTF-2, etc., Masataka Ohta Re: 10646, UTF-2, etc., Steve Summit Re: 10646, UTF-2, etc., Masataka Ohta Re: 10646, UTF-2, etc., henry Re: 10646, UTF-2, etc., Erik M. van der Poel Re: 10646, UTF-2, etc., henry

Previous by Date:	10646 etc., henry
Next by Date:	Re: restrictions when defining charsets, henry
Previous by Thread:	10646 etc., henry
Next by Thread:	Re: 10646, UTF-2, etc., Masataka Ohta
Indexes:	[Date] [Thread] [Top] [All Lists]