On Wed, 13 Sep 2000, Philip Newton wrote:
On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte
chunks, since UTF-16 contains surrogates (high-surrogate + low-
surrogate [or the other way around?] = 1 character, represented
with four bytes). UCS-2, OTH, is always two bytes.
Until someone extends the Unicode character set beyond the current range,
UCS-2 and UTF-16 currently have a one to one mapping. I assume thats the
point being made. An excerpt from the book I'm currently tech reviewing:
Nonetheless Unicode does provide a means of representing code points
beyond 64,535 by recognizing certain two-byte sequences as half of a
surrogate pair. A Unicode document that uses UCS-2 plus surrogate
pairs is said to be in the UTF-16 encoding. Since no software
currently supports or produces surrogate pairs, and since no scripts
are encoded in Unicode with code points above 65,535 the
distinction between UCS-2 and UTF-16 is mostly academic at this
point in time.
--
<Matt/>
Fastnet Software Ltd. High Performance Web Specialists
Providing mod_perl, XML, Sybase and Oracle solutions
Email for training and consultancy availability.
http://sergeant.org | AxKit: http://axkit.org