perl-unicode

Re: UCS-2 and UTF-16 [was Re: Encode, take five]

2000-09-14 03:10:37
On 13 Sep 2000, at 11:57, Mark Leisher wrote:

True, UTF-16 is not known as UCS-2.  However, UTF-16 still consists
of 2-byte chunks.  It is essentially UCS-2 plus high and low
surrogates (see the Unicode Standard 3.0 page 19). 

Yes, but if you just have a high surrogate, you can't do much with it -- 
it doesn't represent a Unicode character but only half of one. So you 
need a high surrogate plus a low surrogate to display a character 
beyond U+FFFF, leading to a 32-bit representation for such 
characters (2 2-byte chunks).

Combining surrogates constitutes a UCS-4 encoding (or UTF-32 until
unavailable 10646 private use regions are removed). 

I'm not sure what you mean by this. UTF-32 is always 4 bytes per 
char; UTF-16 is 2 bytes or 4 bytes per char, depending on the code 
point (variable-length, just as UTF-8 is).

Cheers,
philip

<Prev in Thread] Current Thread [Next in Thread>