perl-unicode

UCS-2 and UTF-16 [was Re: Encode, take five]

2000-09-13 10:57:29

    Philip> On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
    >> UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,

    Philip> As I understand it, that's not true -- UTF-16 is 2-byte *or*
    Philip> 4-byte chunks, since UTF-16 contains surrogates (high-surrogate +
    Philip> low- surrogate [or the other way around?] = 1 character,
    Philip> represented with four bytes). UCS-2, OTH, is always two bytes.

True, UTF-16 is not known as UCS-2.  However, UTF-16 still consists of 2-byte
chunks.  It is essentially UCS-2 plus high and low surrogates (see the Unicode
Standard 3.0 page 19).  Combining surrogates constitutes a UCS-4 encoding (or
UTF-32 until unavailable 10646 private use regions are removed).
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            Cinema, radio, television, magazines are a
New Mexico State University       school of inattention: people look without
Box 30001, Dept. 3CRL             seeing, listen without hearing.
Las Cruces, NM  88003                            -- Robert Bresson

<Prev in Thread] Current Thread [Next in Thread>