Philip> Yes, but if you just have a high surrogate, you can't do much with
Philip> it -- it doesn't represent a Unicode character but only half of
Philip> one. So you need a high surrogate plus a low surrogate to display
Philip> a character beyond U+FFFF, leading to a 32-bit representation for
Philip> such characters (2 2-byte chunks).
The point is that UTF-16 can be re-encoded into a 32-bit representation, but
is not itself a 32-bit representation.
>> Combining surrogates constitutes a UCS-4 encoding (or UTF-32 until
>> unavailable 10646 private use regions are removed).
Philip> I'm not sure what you mean by this. UTF-32 is always 4 bytes per
Philip> char; UTF-16 is 2 bytes or 4 bytes per char, depending on the code
Philip> point (variable-length, just as UTF-8 is).
ISO 10646 has code ranges that can not be represented using UTF-16. Thus, the
term UTF-32 was introduced to indicate the 21-bit encoding of UTF-16 with the
surrogates combined. UTF-32 also indicates Unicode semantics.
These 10646 code ranges outside the reach of UTF-16 are scheduled to be
removed. When they are removed, the term UTF-32 will be deprecated and the
term UCS-4 will be used instead.
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab Cinema, radio, television, magazines are a
New Mexico State University school of inattention: people look without
Box 30001, Dept. 3CRL seeing, listen without hearing.
Las Cruces, NM 88003 -- Robert Bresson