Re: Encode, take five

On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:

UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,


As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte 
chunks, since UTF-16 contains surrogates (high-surrogate + low-
surrogate [or the other way around?] = 1 character, represented 
with four bytes). UCS-2, OTH, is always two bytes.

and UTF-32 as UCS-4, 32-bit or 4-byte chunks.


Note that ISO 10646 is really on 31 bits, not 32 -- probably for 
signed/unsigned legacy reasons. One difference is that UCS-4 has 
code points from 0 to 0x7fffffff, while UTF-32 only goes to 0x10ffff, the 
largest possible Unicode code point. But Unicode got ISO go agree 
not to allocate characters that are not representable with UTF-16, ie 
above 0x10ffff.[1]  

Apparently there are also differences since UCS-* are "encoding 
forms" specified by ISO/IEC 10646, while UTF-* are "Unicode 
transformation formats" specified by Unicode, Inc., and convey 
additional semantics. Take, for example, this quote from 
http://www.unicode.org/unicode/reports/tr19/ :

"Over and above ISO 10646, the Unicode Standard adds a number of 
conformance constraints on character semantics (see The Unicode 
Standard, Version 3.0, Chapter 3). Declaring UTF-32 instead of UCS-4 
allows implementations to explicitly commit to Unicode semantics. "  

This may or may not be significant for Perl.

Cheers,
Philip

[1] Another, longer quote from TR 19:

"Relation to ISO/IEC 10646 and UCS-4

ISO/IEC 10646 defines a 4-byte encoding form called UCS-4. Since 
UTF-32 is simply a subset of UCS-4 characters, it is conformant to 
ISO/IEC 10646 as well as to the Unicode Standard.  

As of the recent publication of the second edition of ISO/IEC 10646-1, 
UCS-4 still assigns private use codepoints (E00000_16..FFFFFF_16 
and 60000000_16..7FFFFFFF_16) that are not in the range of valid 
Unicode codepoints. To promote interoperability among the Unicode 
encoding forms JTC1/SC2/WG2 has approved a motion removing 
those private use assignments:  

Resolution M38.6 (Restriction of encoding space) [adopted 
unanimously]  

"WG2 accepts the proposal in document N2175 towards removing the 
provision for Private Use Groups and Planes beyond Plane 16 in 
ISO/IEC 10646, to ensure internal consistency in the standard 
between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs its 
project editor [to] prepare suitable text for processing as a future 
Technical Corrigendum or an Amendment to 10646-1:2000."  

While this resolution must still be turned into a Technical Corrigendum 
or an Amendment to 10646-1:2000, the Unicode Technical Committee 
has every expectation that once the text for that Technical 
Corrigendum or Amendment starts its formal balloting it will proceed 
smoothly to formal approval and publication as part of that standard.  

Until the formal balloting is concluded, the term UTF-32 can be used 
to refer to the subset of UCS-4 characters that are in the range of 
valid Unicode code points. After it passes, UTF-32 will then simply be 
an alias for UCS-4 (with the extra requirement  
that Unicode semantics are observed)."