perl-unicode

Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-12 13:50:51
On Wed, 12 Mar 2008 Juerd Waalboer wrote
Chris Hall skribis 2008-03-12 13:20 (+0000):
....
     String literals are represented by UCS code points.  Which
     reinforces the feeling that characters in Perl are Unicode.

Yes!

OK.  For the avoidance of doubt:

  a. are you saying that characters in Perl are Unicode ?

  b. or are you agreeing that characters in Perl take values
     0..0x7FFF_FFFF (or beyond), which are generally interpreted as
     UCS, where required and possible ?

If (a) then characters with ordinals beyond 0x10_FFFF should throw warnings (at least) since they clearly are not Unicode !

....[in the context of U+D800..U+DFFF]
            "Isolated surrogate code units have no interpretation on
             their own."
(...)
           Clearly these are illegal in UTF-8.

They have no interpretation, but this also doesn't say it's illegal.

The Unicode Standard defines the set of 'Unicode scalar values' which consists of U+0000..U+D7FF and U+E000..U+10_FFFF. All Unicode encodings, including UTF-8, encode only the 'Unicode scalar values'.

The code points U+D800..U+DFFF exist, but do "not contain any character assignments". Given that no Unicode encoding exists that allows these code points, it's unclear how one would ever end up with one of these things on its hands !

....[in the context of U+FFFE, U+FFFF etc.]
            "Applications are free to use any of these noncharacter code
             points internally but should never attempt to exchange
             them.

I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.

Well UTF-8 is jumping all over U+FFFF (at least). The warnings thrown by chr() and "\x{h...h} suggest that Perl feels that exchanging these values ain't kosher.

I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+FFFF (and friends).

My gut says it's out of ignorance of the "rules", and certainly not an
intentional deviation.

Well... I'm running some more tests on UTF-8 to see what it thinks is illegal.

.....................................
>The result is Unicode.
IMHO the result of chr(n) should just be a character.

We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.

OK.  This is the hair which I am splitting.

IMHO the things in strings and the things that chr() and ord() return or process should be plain characters (ordinal U_INT) -- so that these are general purpose. Only when it's necessary to attach meaning to the characters in a string, should Perl treat them as Unicode code points -- I accept that this is most of the time (but not *all* the time).

FWIW I note that printf "%vX" is suggested as a means to render IPv6
addresses.  This implies the use of a string containing eight characters
0..0xFFFF as the packed form of IPv6.  Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !

Interesting point.

What's more, the Unicode standard suggests various *internal* uses for U+FFFE and U+FFFF (and friends), including, but not limited to, terminators and separators. This will also generate spurious warnings from chr() or "\x{...}" !

Chris
--
Chris Hall               highwayman.com

Attachment: signature.asc
Description: PGP signature