Re: perlunitut - feedback appreciated

On Sun, Nov 11, 2001 at 12:57:27PM -0800, Edward Cherlin wrote:

Thanks. The Perl implementors and you have done a very good job. I have a
few suggestions and one complaint.

The most important issue is chr().

Note that C<chr(...)> for arguments less than 0x100 (decimal 256) will
return an eight-bit character for backward compatibility with older
Perls (in ISO 8859-1 platforms it can be argued to be producing
Unicode even then, just not Unicode encoded in UTF-8 -- the ISO 8859-1
is equivalent to the first 256 characters of Unicode).  For C<chr()>
arguments of 0x100 or more, Unicode will always be produced.


My complaint: There should be a pure Unicode alternative to this kludge.


You mean chr() producing UTF-8?  There has been talk about uchr() or
the like.  Maybe I'll just implement it in some module.

Obviously, it is not hard to write one in Perl, but it should be part of the
implementation.

ISO Latin-1 characters encoded as 10-FF in single bytes are not Unicode.
There is no Unicode transformation format or other encoding that permits
this. The code point range is actually x000010-x0000FF, and the encodings
are

0000000010000000  0000000011111111 UTF-16 Big Endian
1000000000000000  1111111100000000 UTF-16 Little Endian
00000000000000000000000010000000  00000000000000000000000011111111 UCS-4 BE
00000000000000001000000000000000  00000000000000001111111100000000 UCS-4 LE
1100001010000000  1100001110111111 UTF-8


Okay.

Character ranges in regular expression character classes [a-z]
and in the tr///, aka y///, operator are not affected by Unicode.


This could mean that they extend gracefully to Unicode, for example
something like [\{x0300}-\{x03FF}], or that they cannot be used outside the
00-FF range (or would it be 00-7F?). Clarification is needed.


Hmmm.  They extend but they may not do what people are expecting them
to do: [a-z] will most certainly not mean "alphabetic characters".

Unicode is a standard that defines a unique number for every character.


Unique: Some characters are encoded in Unicode twice. Examples include
A-ring, also encoded as the Angstrom symbol, and a number of
full-width/half-width variants from Japanese standards.


Argh.  This has been the most contested point of the document :-)
My take is that too many buts, ifs, and furthermores muddle the
message.

Number: Please say "code point" rather than number.


http://www.unicode.org/unicode/standard/WhatIsUnicode.html

Every character: Unicode and ISO/IEC 10646 are coordinated standards that
provide code points for the characters in almost all modern character set
standards, covering more than 30 writing systems and hundreds of languages,
including all commercially important modern languages. All characters in the
largest Chinese, Japanese, and Korean dictionaries are also encoded. The
standards will eventually cover almost all characters in more than 250
writing systems and thousands of languages, but will not include proprietary
characters, personal-use characters, and some others.


Nice chunk of text.  Can I borrow?  Though the 'proprietary characters'
part is a bit debatable.  What is a proprietary character?  Is, say,
HP's roman-8 proprietary?  All its characters are in the Unicode (AFAIK).

Note that no platform today (Java, Unix, Mac, Windoze) includes rendering
capability for all of the writing systems defined in Unicode, even where
appropriate fonts are available. The greatest deficits are in Armenian,
Georgian, Ethiopic, and writing systems of Asia, including India, Tibet,
Mongolia, Sri Lanka, Burma, and Cambodia.


Hmmm.  I probably have to mention something about the display of
Unicode but I'd rather keep it short and just refer to nice URLs.

Since Unicode 3.1 Unicode characters have been defined all the way
up to 21 bits...


Unicode 1.0 began as a 16-bit character set, defining code points in the
range 0000-FFFF. ISO/IEC 10646 defines its corresponding region
00000000-0000FFFF as the Basic Multilingual Plane (Plane 0). Since Unicode
2.0, the Unicode code space has been defined to be 000000-10FFFF, adding 16
more planes. This is often described as a 20.5 bit encoding. A set of
language tag characters was defined in Plane 14. Their use is highly
deprecated.

In Unicode 3.1 characters were defined in Planes 1 and 2, and there are
plans for Plane 3, at least, to be populated in Unicode 4.0. ISO plans to
vote soon to restrict 10646 to the corresponding range, 00000000-0010FFFF.


Uhhh, that's quite an information overload for an introductory
document.  Remember, this is not intended as comprehensive retelling
of the Unicode FAQ, just the bare essential to start learning more.
But saying a bit more about the history of Unicode is probably a good
idea.

Some mention should be made of surrogates. They do not appear in UTF-8, but
many people are unclear on this point. They are also not characters.


In the latest version (the http://www.iki.fi/jhi/perlunitut.pod is
constantly updated) I mention surrogates, but I just point to
perlunicode (the actual reference).

Mention should be made of the rule requiring the use of shortest-length
UTF-8 representations. Violations of this rule constitute a security hazard
in communications. I hope that Perl observes this rule.


Yes, we have a regression test in our test suite that uses Markus
Kuhn's appropriate tests.  Perl generates only shortest-length, and
non-shortest UTF-8 will generate a warning.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen