perl-unicode

Re: utf8::valid and \x14_000 - \x1F_0000

2008-03-11 11:49:38
On Tue, 11 Mar 2008 you wrote
Chris Hall skribis 2008-03-11 13:30 (+0000):
I suggest utf8::valid() is broken.
    my $s = chr($c) ;
    my $v = utf8::valid($s) ? 1 : 0 ;

Agreed. utf8::valid(chr $foo) should ALWAYS return true. (Please note
that utf8::valid tests the internal consistency of a string - on the
outside, it has little to do with UTF8.)

I'm comfortable with the notion that perl characters are unsigned
integers that overlap UCS, and happen to be held internally as a
superset of UTF-8.

I wonder if perl is completely comfortable.

chr(n) throws various runtime warnings where 'n' isn't kosher UCS, and
"\x{h...h}" throws the same ones at compile time.

Now there's HUGE areas of UCS code space that are essentially
meaningless.  There are VAST areas of perl character space that are way
beyond UCS.  I'm not sure I see the point of picking on a few values to
warn about.

In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
neither.

I have tried the following on 5.10.0 and 5.8.8, and where these differ I
have noted it:

  chr(-1)

     5.10.0: No warning, returned "\x{FFFD}"
     5.8.8:  Warning 'Unicode character 0xffffffffffffffff is illegal',
             returned "\x{FFFF_FFFF_FFFF_FFFF}"

     Neither of these seem very sensible.

     If chr(-1) doesn't exist, then undef looks like a reasonable
     return value -- returning "\x{FFFD}" makes chr(-1)
     indistinguishable from chr(0xFFFD) -- where the first is
     nonsense and the second is entirely proper.

  chr(0xD800) Warns 'UTF-16 surrogate 0xd800', returns "\x{D800}"

  chr(0xFFFD) No warning, returns "\x{FFFD}"

  chr(0xFFFE) Warns 'Unicode character 0xfffe is illegal',
              returns "\x{FFFE}",

     NB: both Encode::encode('UTF-8', "\x{FFFE}")
          and Encode::decode{'UTF-8', "\xEF\xBF\xBE")

         are perfectly happy !  This appears inconsistent ?

     All the UCS planes appear to be treated like this.

  chr(0xFFFF) Warns 'Unicode character 0xffff is illegal',
              returns "\x{FFFF}",

     NB: both Encode::encode('UTF-8', "\x{FFFF}")
          and Encode::decode{'UTF-8', "\xEF\xBF\xBF")

         consider this to be illegal, and replace it by "\x{FFFD}"

     All the UCS planes appear to be treated like this.

  chr(0x11_0000) No warning, returns "\x11_0000"

     This is now outside the UCS range, so I suppose we don't care
     that this is no more useful than chr(0xFFFE) ?

     Modern (RFC 3629 & Unicode Consortium) UTF-8 is defined to
     exclude sequences that exceed the (current) UCS maximum of
     U+10_FFFF.

  chr(0x14_0000) No warning, returns "\x14_0000"

     Modern UTF-8 (RFC 3629 & Unicode Consortium) is defined to
     exclude any sequence containing any byte 0xC0, 0xC1,
     and 0xF5-0xFF.  This is the first character that contains a
     byte 0xF5-0xFF !

  chr(0xzzzz_FFFE) Warns 'Unicode character 0xzzzzfffe is illegal',
                   returns "\x{zzzz_FFFE}"
  chr(0xzzzz_FFFF) Warns 'Unicode character 0xzzzzffff is illegal',
                   returns "\x{zzzz_FFFF}"

     For all values of zzzz from 0x0011 onwards.

     Now, it's known that 0xFFFE and 0xFFFF are non-characters in all
     UCS planes...  but we're beyond UCS here ?

     [I confess this baffled me at first, because 0x7FFF_FFFF
      generates a warning, but 0x8000_0000 doesn't....  But that's
      another story.]

  chr(0x0020_0000) No warning, returns "\x{0020_0000}"

     This is the first character with an encoding > 4 bytes.

     Modern UTF-8 (RFC 3629 & Unicode Consortium) stops at 4 bytes.

  chr(0x8000_0000) No warning, returns "\x{8000_0000}"

     This is the first character with an encoding > 6 bytes.

     Actually, not even 'old-style' UTF-8 supported anything longer
     than the 6 byte form.  (Because bytes 0xFE and 0xFF were defined
     not to appear in a UTF-8 sequence -- to guarantee no confusion
     with UTF-16.

Compile time warnings for "\x{h...h}" appear to complain or not complain
about the same things.

Could you please report this bug with perlbug?

Done.

Chris
-- 
Chris Hall               highwayman.com            +44 7970 277 383

Attachment: signature.asc
Description: PGP signature