Re: utf8::valid and \x14_000

On Wed, 12 Mar 2008 Juerd Waalboer wrote

Chris Hall skribis 2008-03-12 13:20 (+0000):

....

     String literals are represented by UCS code points.  Which
     reinforces the feeling that characters in Perl are Unicode.

Yes!


OK.  For the avoidance of doubt:

  a. are you saying that characters in Perl are Unicode ?

  b. or are you agreeing that characters in Perl take values
     0..0x7FFF_FFFF (or beyond), which are generally interpreted as
     UCS, where required and possible ?

If (a) then characters with ordinals beyond 0x10_FFFF should throwwarnings (at least) since they clearly are not Unicode !


....[in the context of U+D800..U+DFFF]

            "Isolated surrogate code units have no interpretation on
             their own."
(...)
           Clearly these are illegal in UTF-8.

They have no interpretation, but this also doesn't say it's illegal.

The Unicode Standard defines the set of 'Unicode scalar values' whichconsists of U+0000..U+D7FF and U+E000..U+10_FFFF. All Unicodeencodings, including UTF-8, encode only the 'Unicode scalar values'.

The code points U+D800..U+DFFF exist, but do "not contain any characterassignments". Given that no Unicode encoding exists that allows thesecode points, it's unclear how one would ever end up with one of thesethings on its hands !


....[in the context of U+FFFE, U+FFFF etc.]

            "Applications are free to use any of these noncharacter code
             points internally but should never attempt to exchange
             them.

I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.

Well UTF-8 is jumping all over U+FFFF (at least). The warnings thrownby chr() and "\x{h...h} suggest that Perl feels that exchanging thesevalues ain't kosher.

I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
friends) in the same way as U+FFFF (and friends).

My gut says it's out of ignorance of the "rules", and certainly not an
intentional deviation.

Well... I'm running some more tests on UTF-8 to see what it thinks isillegal.


.....................................

>The result is Unicode.
IMHO the result of chr(n) should just be a character.

We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.


OK.  This is the hair which I am splitting.

IMHO the things in strings and the things that chr() and ord() return orprocess should be plain characters (ordinal U_INT) -- so that these aregeneral purpose. Only when it's necessary to attach meaning to thecharacters in a string, should Perl treat them as Unicode code points --I accept that this is most of the time (but not *all* the time).

FWIW I note that printf "%vX" is suggested as a means to render IPv6
addresses.  This implies the use of a string containing eight characters
0..0xFFFF as the packed form of IPv6.  Building one of those using
chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !

Interesting point.

What's more, the Unicode standard suggests various *internal* uses forU+FFFE and U+FFFF (and friends), including, but not limited to,terminators and separators. This will also generate spurious warningsfrom chr() or "\x{...}" !


Chris
--
Chris Hall               highwayman.com

signature.asc
Description: PGP signature

Re: utf8::valid and \x14_000 - \x1F_0000