perl-unicode

[edward(_at_)webforhumans(_dot_)com: RE: perlunitut - feedback appreciated]

2001-11-12 06:55:25
----- Forwarded message from Edward Cherlin 
<edward(_at_)webforhumans(_dot_)com> -----

Subject: RE: perlunitut - feedback appreciated
From: Edward Cherlin <edward(_at_)webforhumans(_dot_)com>
Date: Sun, 11 Nov 2001 23:54:17 -0800
Message-id: <004701c16b4f$3b7caf20$1e00a8c0(_at_)mcp>
To: "'Jarkko Hietaniemi'" <jhi(_at_)iki(_dot_)fi>
Cc: linux-utf8(_at_)nl(_dot_)linux(_dot_)org
In-reply-to: <20011111232402(_dot_)I602(_at_)alpha(_dot_)hut(_dot_)fi>
Importance: Normal

I am unable to post to perl-unicode(_at_)perl(_dot_)org(_dot_) Please forward.

-----Original Message-----
From: Jarkko Hietaniemi [mailto:jhi(_at_)iki(_dot_)fi]
Sent: Sunday, November 11, 2001 1:24 PM
To: Edward Cherlin
Cc: perl-unicode(_at_)perl(_dot_)org; 
linux-utf8(_at_)nl(_dot_)linux(_dot_)org>

On Sun, Nov 11, 2001 at 12:57:27PM -0800, Edward Cherlin wrote:
Thanks. The Perl implementors and you have done a very good
job. I have a
few suggestions and one complaint.

The most important issue is chr().

Note that C<chr(...)> for arguments less than 0x100
(decimal 256) will
return an eight-bit character for backward compatibility with older
Perls (in ISO 8859-1 platforms it can be argued to be producing
Unicode even then, just not Unicode encoded in UTF-8 --
the ISO 8859-1
is equivalent to the first 256 characters of Unicode).
For C<chr()>
arguments of 0x100 or more, Unicode will always be produced.

My complaint: There should be a pure Unicode alternative to
this kludge.

You mean chr() producing UTF-8?  There has been talk about uchr() or
the like.  Maybe I'll just implement it in some module.

Good. Thanks.


Character ranges in regular expression character classes [a-z]
and in the tr///, aka y///, operator are not affected by Unicode.

This could mean that they extend gracefully to Unicode, for example
something like [\{x0300}-\{x03FF}], or that they cannot be
used outside the
00-FF range (or would it be 00-7F?). Clarification is needed.

Hmmm.  They extend but they may not do what people are expecting them
to do: [a-z] will most certainly not mean "alphabetic characters".

Definitely. They will have to include characters in Latin 1, Latin Extended
A, Latin Extended B, at least.

Unicode is a standard that defines a unique number for
every character.

Just say: Unicode is a character set standard with plans to cover all of the
writing systems of the world, plus many other symbols.

Unique: Some characters are encoded in Unicode twice.
Examples include
A-ring, also encoded as the Angstrom symbol, and a number of
full-width/half-width variants from Japanese standards.

Argh.  This has been the most contested point of the document :-)
My take is that too many buts, ifs, and furthermores muddle the
message.

Number: Please say "code point" rather than number.

http://www.unicode.org/unicode/standard/WhatIsUnicode.html

Every character: Unicode and ISO/IEC 10646 are coordinated
standards that
provide code points for the characters in almost all modern
character set
standards, covering more than 30 writing systems and
hundreds of languages,
including all commercially important modern languages. All
characters in the
largest Chinese, Japanese, and Korean dictionaries are also
encoded. The
standards will eventually cover almost all characters in
more than 250
writing systems and thousands of languages, but will not
include proprietary
characters, personal-use characters, and some others.

Nice chunk of text.  Can I borrow?

Certainly.


Though the 'proprietary
characters'
part is a bit debatable.  What is a proprietary character?  Is, say,
HP's roman-8 proprietary?  All its characters are in the
Unicode (AFAIK).

The Apple Open-Apple character is proprietary. Roman-8 is just an
arrangement of pre-existing characters.


Note that no platform today (Java, Unix, Mac, Windoze)
includes rendering
capability for all of the writing systems defined in
Unicode, even where
appropriate fonts are available. The greatest deficits are
in Armenian,
Georgian, Ethiopic, and writing systems of Asia, including
India, Tibet,
Mongolia, Sri Lanka, Burma, and Cambodia.

Hmmm.  I probably have to mention something about the display of
Unicode but I'd rather keep it short and just refer to nice URLs.

I don't know of one. Maybe I should do that.


Since Unicode 3.1 Unicode characters have been defined all the way
up to 21 bits...

Just say: Since Unicode 2.0, Unicode characters have been defined up to 21
bits.

Unicode 1.0 began as a 16-bit character set, defining code
points in the
range 0000-FFFF. ISO/IEC 10646 defines its corresponding region
00000000-0000FFFF as the Basic Multilingual Plane (Plane
0). Since Unicode
2.0, the Unicode code space has been defined to be
000000-10FFFF, adding 16
more planes. This is often described as a 20.5 bit
encoding. A set of
language tag characters was defined in Plane 14. Their use is highly
deprecated.

In Unicode 3.1 characters were defined in Planes 1 and 2,
and there are
plans for Plane 3, at least, to be populated in Unicode
4.0. ISO plans to
vote soon to restrict 10646 to the corresponding range,
00000000-0010FFFF.

Uhhh, that's quite an information overload for an introductory
document.  Remember, this is not intended as comprehensive retelling
of the Unicode FAQ, just the bare essential to start learning more.
But saying a bit more about the history of Unicode is probably a good
idea.

Some mention should be made of surrogates. They do not
appear in UTF-8, but
many people are unclear on this point. They are also not characters.

In the latest version (the http://www.iki.fi/jhi/perlunitut.pod is
constantly updated) I mention surrogates, but I just point to
perlunicode (the actual reference).


Mention should be made of the rule requiring the use of
shortest-length
UTF-8 representations. Violations of this rule constitute a
security hazard
in communications. I hope that Perl observes this rule.

Yes, we have a regression test in our test suite that uses Markus
Kuhn's appropriate tests.  Perl generates only shortest-length, and
non-shortest UTF-8 will generate a warning.

Excellent.


--
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

----- End forwarded message -----

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>
  • [edward(_at_)webforhumans(_dot_)com: RE: perlunitut - feedback appreciated], Jarkko Hietaniemi <=