[edward(_at_)webforhumans(_dot_)com: RE: perlunitut

----- Forwarded message from Edward Cherlin 
<edward(_at_)webforhumans(_dot_)com> -----

Subject: RE: perlunitut - feedback appreciated
From: Edward Cherlin <edward(_at_)webforhumans(_dot_)com>
Date: Sun, 11 Nov 2001 23:54:17 -0800
Message-id: <004701c16b4f$3b7caf20$1e00a8c0(_at_)mcp>
To: "'Jarkko Hietaniemi'" <jhi(_at_)iki(_dot_)fi>
Cc: linux-utf8(_at_)nl(_dot_)linux(_dot_)org
In-reply-to: <20011111232402(_dot_)I602(_at_)alpha(_dot_)hut(_dot_)fi>
Importance: Normal

I am unable to post to perl-unicode(_at_)perl(_dot_)org(_dot_) Please forward.

-----Original Message-----
From: Jarkko Hietaniemi [mailto:jhi(_at_)iki(_dot_)fi]
Sent: Sunday, November 11, 2001 1:24 PM
To: Edward Cherlin
Cc: perl-unicode(_at_)perl(_dot_)org; 
linux-utf8(_at_)nl(_dot_)linux(_dot_)org>

On Sun, Nov 11, 2001 at 12:57:27PM -0800, Edward Cherlin wrote:

Thanks. The Perl implementors and you have done a very good

job. I have a

few suggestions and one complaint.

The most important issue is chr().

Note that C<chr(...)> for arguments less than 0x100

(decimal 256) will

return an eight-bit character for backward compatibility with older
Perls (in ISO 8859-1 platforms it can be argued to be producing
Unicode even then, just not Unicode encoded in UTF-8 --

the ISO 8859-1

is equivalent to the first 256 characters of Unicode).

For C<chr()>

arguments of 0x100 or more, Unicode will always be produced.


My complaint: There should be a pure Unicode alternative to

this kludge.

You mean chr() producing UTF-8?  There has been talk about uchr() or
the like.  Maybe I'll just implement it in some module.


Good. Thanks.

Character ranges in regular expression character classes [a-z]
and in the tr///, aka y///, operator are not affected by Unicode.


This could mean that they extend gracefully to Unicode, for example
something like [\{x0300}-\{x03FF}], or that they cannot be

used outside the

00-FF range (or would it be 00-7F?). Clarification is needed.


Hmmm.  They extend but they may not do what people are expecting them
to do: [a-z] will most certainly not mean "alphabetic characters".


Definitely. They will have to include characters in Latin 1, Latin Extended
A, Latin Extended B, at least.

Unicode is a standard that defines a unique number for

every character.


Just say: Unicode is a character set standard with plans to cover all of the
writing systems of the world, plus many other symbols.

Unique: Some characters are encoded in Unicode twice.

Examples include

A-ring, also encoded as the Angstrom symbol, and a number of
full-width/half-width variants from Japanese standards.


Argh.  This has been the most contested point of the document :-)
My take is that too many buts, ifs, and furthermores muddle the
message.

Number: Please say "code point" rather than number.


http://www.unicode.org/unicode/standard/WhatIsUnicode.html

Every character: Unicode and ISO/IEC 10646 are coordinated

standards that

provide code points for the characters in almost all modern

character set

standards, covering more than 30 writing systems and

hundreds of languages,

including all commercially important modern languages. All

characters in the

largest Chinese, Japanese, and Korean dictionaries are also

encoded. The

standards will eventually cover almost all characters in

more than 250

writing systems and thousands of languages, but will not

include proprietary

characters, personal-use characters, and some others.


Nice chunk of text.  Can I borrow?


Certainly.

Though the 'proprietary
characters'
part is a bit debatable.  What is a proprietary character?  Is, say,
HP's roman-8 proprietary?  All its characters are in the
Unicode (AFAIK).


The Apple Open-Apple character is proprietary. Roman-8 is just an
arrangement of pre-existing characters.

Note that no platform today (Java, Unix, Mac, Windoze)

includes rendering

capability for all of the writing systems defined in

Unicode, even where

appropriate fonts are available. The greatest deficits are

in Armenian,

Georgian, Ethiopic, and writing systems of Asia, including

India, Tibet,

Mongolia, Sri Lanka, Burma, and Cambodia.


Hmmm.  I probably have to mention something about the display of
Unicode but I'd rather keep it short and just refer to nice URLs.


I don't know of one. Maybe I should do that.

Since Unicode 3.1 Unicode characters have been defined all the way
up to 21 bits...


Just say: Since Unicode 2.0, Unicode characters have been defined up to 21
bits.

Unicode 1.0 began as a 16-bit character set, defining code

points in the

range 0000-FFFF. ISO/IEC 10646 defines its corresponding region
00000000-0000FFFF as the Basic Multilingual Plane (Plane

0). Since Unicode

2.0, the Unicode code space has been defined to be

000000-10FFFF, adding 16

more planes. This is often described as a 20.5 bit

encoding. A set of

language tag characters was defined in Plane 14. Their use is highly
deprecated.

In Unicode 3.1 characters were defined in Planes 1 and 2,

and there are

plans for Plane 3, at least, to be populated in Unicode

4.0. ISO plans to

vote soon to restrict 10646 to the corresponding range,

00000000-0010FFFF.

Uhhh, that's quite an information overload for an introductory
document.  Remember, this is not intended as comprehensive retelling
of the Unicode FAQ, just the bare essential to start learning more.
But saying a bit more about the history of Unicode is probably a good
idea.

Some mention should be made of surrogates. They do not

appear in UTF-8, but

many people are unclear on this point. They are also not characters.


In the latest version (the http://www.iki.fi/jhi/perlunitut.pod is
constantly updated) I mention surrogates, but I just point to
perlunicode (the actual reference).

Mention should be made of the rule requiring the use of

shortest-length

UTF-8 representations. Violations of this rule constitute a

security hazard

in communications. I hope that Perl observes this rule.


Yes, we have a regression test in our test suite that uses Markus
Kuhn's appropriate tests.  Perl generates only shortest-length, and
non-shortest UTF-8 will generate a warning.


Excellent.

--
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen


----- End forwarded message -----

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

[edward(_at_)webforhumans(_dot_)com: RE: perlunitut - feedback appreciated]