perl-unicode

Re: UCS-2 and UTF-16 [was Re: Encode, take five]

2000-09-14 07:35:29

    Philip> Yes, but if you just have a high surrogate, you can't do much with
    Philip> it -- it doesn't represent a Unicode character but only half of
    Philip> one. So you need a high surrogate plus a low surrogate to display
    Philip> a character beyond U+FFFF, leading to a 32-bit representation for
    Philip> such characters (2 2-byte chunks).

The point is that UTF-16 can be re-encoded into a 32-bit representation, but
is not itself a 32-bit representation.

    >> Combining surrogates constitutes a UCS-4 encoding (or UTF-32 until
    >> unavailable 10646 private use regions are removed).

    Philip> I'm not sure what you mean by this. UTF-32 is always 4 bytes per
    Philip> char; UTF-16 is 2 bytes or 4 bytes per char, depending on the code
    Philip> point (variable-length, just as UTF-8 is).

ISO 10646 has code ranges that can not be represented using UTF-16.  Thus, the
term UTF-32 was introduced to indicate the 21-bit encoding of UTF-16 with the
surrogates combined.  UTF-32 also indicates Unicode semantics.

These 10646 code ranges outside the reach of UTF-16 are scheduled to be
removed.  When they are removed, the term UTF-32 will be deprecated and the
term UCS-4 will be used instead.
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            Cinema, radio, television, magazines are a
New Mexico State University       school of inattention: people look without
Box 30001, Dept. 3CRL             seeing, listen without hearing.
Las Cruces, NM  88003                            -- Robert Bresson

<Prev in Thread] Current Thread [Next in Thread>