On 12 Sep 2000, at 18:42, Jarkko Hietaniemi wrote:
UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks,
As I understand it, that's not true -- UTF-16 is 2-byte *or* 4-byte
chunks, since UTF-16 contains surrogates (high-surrogate + low-
surrogate [or the other way around?] = 1 character, represented
with four bytes). UCS-2, OTH, is always two bytes.
and UTF-32 as UCS-4, 32-bit or 4-byte chunks.
Note that ISO 10646 is really on 31 bits, not 32 -- probably for
signed/unsigned legacy reasons. One difference is that UCS-4 has
code points from 0 to 0x7fffffff, while UTF-32 only goes to 0x10ffff, the
largest possible Unicode code point. But Unicode got ISO go agree
not to allocate characters that are not representable with UTF-16, ie
above 0x10ffff.[1]
Apparently there are also differences since UCS-* are "encoding
forms" specified by ISO/IEC 10646, while UTF-* are "Unicode
transformation formats" specified by Unicode, Inc., and convey
additional semantics. Take, for example, this quote from
http://www.unicode.org/unicode/reports/tr19/ :
"Over and above ISO 10646, the Unicode Standard adds a number of
conformance constraints on character semantics (see The Unicode
Standard, Version 3.0, Chapter 3). Declaring UTF-32 instead of UCS-4
allows implementations to explicitly commit to Unicode semantics. "
This may or may not be significant for Perl.
Cheers,
Philip
[1] Another, longer quote from TR 19:
"Relation to ISO/IEC 10646 and UCS-4
ISO/IEC 10646 defines a 4-byte encoding form called UCS-4. Since
UTF-32 is simply a subset of UCS-4 characters, it is conformant to
ISO/IEC 10646 as well as to the Unicode Standard.
As of the recent publication of the second edition of ISO/IEC 10646-1,
UCS-4 still assigns private use codepoints (E00000_16..FFFFFF_16
and 60000000_16..7FFFFFFF_16) that are not in the range of valid
Unicode codepoints. To promote interoperability among the Unicode
encoding forms JTC1/SC2/WG2 has approved a motion removing
those private use assignments:
Resolution M38.6 (Restriction of encoding space) [adopted
unanimously]
"WG2 accepts the proposal in document N2175 towards removing the
provision for Private Use Groups and Planes beyond Plane 16 in
ISO/IEC 10646, to ensure internal consistency in the standard
between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs its
project editor [to] prepare suitable text for processing as a future
Technical Corrigendum or an Amendment to 10646-1:2000."
While this resolution must still be turned into a Technical Corrigendum
or an Amendment to 10646-1:2000, the Unicode Technical Committee
has every expectation that once the text for that Technical
Corrigendum or Amendment starts its formal balloting it will proceed
smoothly to formal approval and publication as part of that standard.
Until the formal balloting is concluded, the term UTF-32 can be used
to refer to the subset of UCS-4 characters that are in the range of
valid Unicode code points. After it passes, UTF-32 will then simply be
an alias for UCS-4 (with the extra requirement
that Unicode semantics are observed)."