On 13 Sep 2000, at 11:57, Mark Leisher wrote:
True, UTF-16 is not known as UCS-2. However, UTF-16 still consists
of 2-byte chunks. It is essentially UCS-2 plus high and low
surrogates (see the Unicode Standard 3.0 page 19).
Yes, but if you just have a high surrogate, you can't do much with it --
it doesn't represent a Unicode character but only half of one. So you
need a high surrogate plus a low surrogate to display a character
beyond U+FFFF, leading to a 32-bit representation for such
characters (2 2-byte chunks).
Combining surrogates constitutes a UCS-4 encoding (or UTF-32 until
unavailable 10646 private use regions are removed).
I'm not sure what you mean by this. UTF-32 is always 4 bytes per
char; UTF-16 is 2 bytes or 4 bytes per char, depending on the code
point (variable-length, just as UTF-8 is).
Cheers,
philip