perl-unicode

Re: Encode, take five

2000-09-13 12:26:50
On Wed, 13 Sep 2000, Philip Newton wrote:

On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:

UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,

As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte 
chunks, since UTF-16 contains surrogates (high-surrogate + low-
surrogate [or the other way around?] = 1 character, represented 
with four bytes). UCS-2, OTH, is always two bytes.

Until someone extends the Unicode character set beyond the current range,
UCS-2 and UTF-16 currently have a one to one mapping. I assume thats the
point being made. An excerpt from the book I'm currently tech reviewing:

  Nonetheless Unicode does provide a means of representing code points
  beyond 64,535 by recognizing certain two-byte sequences as half of a
  surrogate pair. A Unicode document that uses UCS-2 plus surrogate
  pairs is said to be in the UTF-16 encoding. Since no software
  currently supports or produces surrogate pairs, and since no scripts
  are encoded in Unicode with code points above 65,535 the
  distinction between UCS-2 and UTF-16 is mostly academic at this
  point in time.

-- 
<Matt/>

Fastnet Software Ltd. High Performance Web Specialists
Providing mod_perl, XML, Sybase and Oracle solutions
Email for training and consultancy availability.
http://sergeant.org | AxKit: http://axkit.org

<Prev in Thread] Current Thread [Next in Thread>