perl-unicode

utf8::valid and \x14_000 - \x1F_0000

2008-03-11 06:32:33

It appears that utf8::valid() disagrees with Encode::encode('utf8', ...)
do not agree for characters 0x14_0000 - 0x1F_0000.

I suggest utf8::valid() is broken.

The following:

  use strict ;

  use Encode qw(FB_QUIET LEAVE_SRC) ;

  printf "Perl v%vd & Encode %s\n", $^V, $Encode::VERSION ;

  my $c = 0xFFFF ;
  while ($c < 0x8000_0000) {
    my $s = chr($c) ;

    my $v = utf8::valid($s) ? 1 : 0 ;
    my $o = Encode::encode('utf8', $s, FB_QUIET() | LEAVE_SRC()) ;

    my $r = $o ? 1 : 0 ;

    if ($v != $r) {
      printf "0x%04X_%04X: utf8::valid=%d but Encode::encode=%d  ",
                                    ($c >> 16), $c & 0xFFFF, $v, $r ;
      Encode::_utf8_off($s) ;
      print map { sprintf '\x%02X', ord($_) } split(//, $s) ;
      print "\n" ;
    } ;

    if ($c & 0xFFFF) { $c += 1 ; } else { $c += 0xFFFF ; } ;
  } ;

Produces:

  Perl v5.8.8 & Encode 2.23
  0x0014_0000: utf8::valid=0 but Encode::encode=1  \xF5\x80\x80\x80
  0x0014_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\x8F\xBF\xBF
  0x0015_0000: utf8::valid=0 but Encode::encode=1  \xF5\x90\x80\x80
  0x0015_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\x9F\xBF\xBF
  0x0016_0000: utf8::valid=0 but Encode::encode=1  \xF5\xA0\x80\x80
  0x0016_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\xAF\xBF\xBF
  0x0017_0000: utf8::valid=0 but Encode::encode=1  \xF5\xB0\x80\x80
  0x0017_FFFF: utf8::valid=0 but Encode::encode=1  \xF5\xBF\xBF\xBF
  0x0018_0000: utf8::valid=0 but Encode::encode=1  \xF6\x80\x80\x80
  0x0018_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\x8F\xBF\xBF
  0x0019_0000: utf8::valid=0 but Encode::encode=1  \xF6\x90\x80\x80
  0x0019_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\x9F\xBF\xBF
  0x001A_0000: utf8::valid=0 but Encode::encode=1  \xF6\xA0\x80\x80
  0x001A_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\xAF\xBF\xBF
  0x001B_0000: utf8::valid=0 but Encode::encode=1  \xF6\xB0\x80\x80
  0x001B_FFFF: utf8::valid=0 but Encode::encode=1  \xF6\xBF\xBF\xBF
  0x001C_0000: utf8::valid=0 but Encode::encode=1  \xF7\x80\x80\x80
  0x001C_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\x8F\xBF\xBF
  0x001D_0000: utf8::valid=0 but Encode::encode=1  \xF7\x90\x80\x80
  0x001D_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\x9F\xBF\xBF
  0x001E_0000: utf8::valid=0 but Encode::encode=1  \xF7\xA0\x80\x80
  0x001E_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\xAF\xBF\xBF
  0x001F_0000: utf8::valid=0 but Encode::encode=1  \xF7\xB0\x80\x80
  0x001F_FFFF: utf8::valid=0 but Encode::encode=1  \xF7\xBF\xBF\xBF

And the same for: Perl v5.10.0 & Encode 2.23
-- 
Chris Hall               highwayman.com

Attachment: signature.asc
Description: PGP signature