Re: UCS-2 and UTF-16 [was Re: Encode, take five]

On 13 Sep 2000, at 11:57, Mark Leisher wrote:

True, UTF-16 is not known as UCS-2.  However, UTF-16 still consists
of 2-byte chunks.  It is essentially UCS-2 plus high and low
surrogates (see the Unicode Standard 3.0 page 19).


Yes, but if you just have a high surrogate, you can't do much with it -- 
it doesn't represent a Unicode character but only half of one. So you 
need a high surrogate plus a low surrogate to display a character 
beyond U+FFFF, leading to a 32-bit representation for such 
characters (2 2-byte chunks).

Combining surrogates constitutes a UCS-4 encoding (or UTF-32 until
unavailable 10646 private use regions are removed).


I'm not sure what you mean by this. UTF-32 is always 4 bytes per 
char; UTF-16 is 2 bytes or 4 bytes per char, depending on the code 
point (variable-length, just as UTF-8 is).

Cheers,
philip

<Prev in Thread]	Current Thread	[Next in Thread>
Re: Encode, take five (malformed UTF-8), (continued) Re: Encode, take five (malformed UTF-8), Markus Kuhn Re: Encode, take five (malformed UTF-8), Jarkko Hietaniemi Re: Encode, take five (malformed UTF-8), Jarkko Hietaniemi Re: Encode, take five, Jarkko Hietaniemi Re: Encode, take five, Jarkko Hietaniemi Re: Encode, take five, Nick Ing-Simmons Re: Encode, take five, Jarkko Hietaniemi Re: Encode, take five, Philip Newton Re: Encode, take five, Jarkko Hietaniemi UCS-2 and UTF-16 [was Re: Encode, take five], Mark Leisher Re: UCS-2 and UTF-16 [was Re: Encode, take five], Philip Newton <= Re: UCS-2 and UTF-16 [was Re: Encode, take five], Mark Leisher Re: Encode, take five, Matt Sergeant Re: Encode, take five, Philip Newton Re: Encode, take five, Ed Batutis