RE: perlunitut - feedback appreciated

Thanks. The Perl implementors and you have done a very good job. I have a
few suggestions and one complaint.

The most important issue is chr().

Note that C<chr(...)> for arguments less than 0x100 (decimal 256) will
return an eight-bit character for backward compatibility with older
Perls (in ISO 8859-1 platforms it can be argued to be producing
Unicode even then, just not Unicode encoded in UTF-8 -- the ISO 8859-1
is equivalent to the first 256 characters of Unicode).  For C<chr()>
arguments of 0x100 or more, Unicode will always be produced.


My complaint: There should be a pure Unicode alternative to this kludge.
Obviously, it is not hard to write one in Perl, but it should be part of the
implementation.

ISO Latin-1 characters encoded as 10-FF in single bytes are not Unicode.
There is no Unicode transformation format or other encoding that permits
this. The code point range is actually x000010-x0000FF, and the encodings
are

0000000010000000  0000000011111111 UTF-16 Big Endian
1000000000000000  1111111100000000 UTF-16 Little Endian
00000000000000000000000010000000  00000000000000000000000011111111 UCS-4 BE
00000000000000001000000000000000  00000000000000001111111100000000 UCS-4 LE
1100001010000000  1100001110111111 UTF-8

Character ranges in regular expression character classes [a-z]
and in the tr///, aka y///, operator are not affected by Unicode.


This could mean that they extend gracefully to Unicode, for example
something like [\{x0300}-\{x03FF}], or that they cannot be used outside the
00-FF range (or would it be 00-7F?). Clarification is needed.

Unicode is a standard that defines a unique number for every character.


Unique: Some characters are encoded in Unicode twice. Examples include
A-ring, also encoded as the Angstrom symbol, and a number of
full-width/half-width variants from Japanese standards.

Number: Please say "code point" rather than number.

Every character: Unicode and ISO/IEC 10646 are coordinated standards that
provide code points for the characters in almost all modern character set
standards, covering more than 30 writing systems and hundreds of languages,
including all commercially important modern languages. All characters in the
largest Chinese, Japanese, and Korean dictionaries are also encoded. The
standards will eventually cover almost all characters in more than 250
writing systems and thousands of languages, but will not include proprietary
characters, personal-use characters, and some others.


Note that no platform today (Java, Unix, Mac, Windoze) includes rendering
capability for all of the writing systems defined in Unicode, even where
appropriate fonts are available. The greatest deficits are in Armenian,
Georgian, Ethiopic, and writing systems of Asia, including India, Tibet,
Mongolia, Sri Lanka, Burma, and Cambodia.

Since Unicode 3.1 Unicode characters have been defined all the way
up to 21 bits...


Unicode 1.0 began as a 16-bit character set, defining code points in the
range 0000-FFFF. ISO/IEC 10646 defines its corresponding region
00000000-0000FFFF as the Basic Multilingual Plane (Plane 0). Since Unicode
2.0, the Unicode code space has been defined to be 000000-10FFFF, adding 16
more planes. This is often described as a 20.5 bit encoding. A set of
language tag characters was defined in Plane 14. Their use is highly
deprecated.

In Unicode 3.1 characters were defined in Planes 1 and 2, and there are
plans for Plane 3, at least, to be populated in Unicode 4.0. ISO plans to
vote soon to restrict 10646 to the corresponding range, 00000000-0010FFFF.


Some mention should be made of surrogates. They do not appear in UTF-8, but
many people are unclear on this point. They are also not characters.

Mention should be made of the rule requiring the use of shortest-length
UTF-8 representations. Violations of this rule constitute a security hazard
in communications. I hope that Perl observes this rule.

-----Original Message-----
From: Jarkko Hietaniemi [mailto:jhi(_at_)iki(_dot_)fi]
Sent: Saturday, November 10, 2001 10:54 AM
To: perl-unicode(_at_)perl(_dot_)org
Cc: Markus Kuhn; linux-utf8(_at_)nl(_dot_)linux(_dot_)org
Subject: perlunitut - feedback appreciated


For the upcoming Perl 5.8.0 release I just recently wrote the
following little introductory text.  Any feedback appreciated.

      http://www.iki.fi/jhi/perlunitut.pod

--
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen