In his presentation of the Plan Nine UTF-2 work at the recent Usenix
conference, Rob Pike made an interesting point that is quite relevant
to the assorted discussions about >8-bit character sets. He said
(roughly) "the hard part is making the code understand that octets
and characters are not synonymous". Once that is done, the details --
how the two are related, whether a character is 16 or 32 bits, etc. --
are very much secondary, particularly if libraries etc. are designed
to hide the implementation details properly. They changed from the old
10646-appendix UTF to UTF-2 in an afternoon: header files and
libraries were replaced, a big recursive "make" was done to rebuild
the software, and a little program ran around finding UTF disk files
and converting them in place.
He also noted that there was one visible benefit from switching to
UTF-2: a lot of bugs disappeared.
Henry Spencer at U of Toronto Zoology
henry(_at_)zoo(_dot_)toronto(_dot_)edu utzoo!henry