perl-unicode

Perl 5.8.0 utf8/regex bug

2004-02-21 16:30:05
Not sure which list to post this to since it relates to Perl unicode, but it's also a bug of sorts...

Here's the test script that reproduces the problem, the perl -V report of the system with the problem, followed by the results from the test script. After that is another perl -V report from another Linux machine that does not exhibit the problem (it now has Encode 1.99, but was running an older version of Encode-- I want to say it was 1.75; that older version did not exhibit this problem either):

------------- test script -------------------

#!/usr/bin/perl

use strict;
use warnings;

require Encode;

my $text = 'a';
Encode::_utf8_on($text);

print "Perl version = $]; Encode version = $Encode::VERSION\n";

print "character info: string = [$text]; length = ".length($text)."\n";
{
  use bytes;
  print "byte info: string = [$text]; length = ".length($text)."\n";
}

print "** UTF8 flag on **\n";
tests($text);

Encode::_utf8_off($text);
print "** UTF8 flag off **\n";
tests($text);

sub tests {
  my ($text) = @_;
  my @tests = (
      qr/^([^\s]*?)$/,
      qr/([^\s]*)?/,
      qr/([^\s]*)/,
      qr/([^\s]?)/,
      qr/([^\s])?/,
      qr/([^\s]+?)/,
      qr/([^\s]+)?/,
      qr/([^\s]+)/,
      qr/([^\s]{1,})/,
      qr/([^\s])/
  );

  my $i = 1;
  foreach (@tests) {
    printf "test %02d: expression: %20s; ", $i++, $_;
    if ($text =~ m/$_/) {
      print "\$1 = [$1]; length of \$1 = ".length($1)."\n";
    } else {
      print "didn't match\n";
    }
  }
}


------------- perl -v from problem machine -------------------

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
osname=linux, osvers=2.4.21-1.1931.2.382.entsmp, archname=i386-linux-thread-multi uname='linux stripples.devel.redhat.com 2.4.21-1.1931.2.382.entsmp #1 smp wed aug 6 17:18:52 edt 2003 i686 i686 i386 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 -Dmyhostname=localhost -Dperladmin=root(_at_)localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr'
    hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -g -pipe -march=i386 -mcpu=i686',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.3.2'
  Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
Compile-time options: DEBUGGING MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
  Locally applied patches:
        MAINT18379
  Built under linux
  Compiled at Aug 13 2003 11:47:58
  @INC:
    /usr/lib/perl5/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/5.8.0
    /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.0
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.0
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/5.8.0
    .


------------- Test script results from problem machine -------------------

Perl version = 5.008; Encode version = 1.83
character info: string = [a]; length = 1
byte info: string = [a]; length = 1
** UTF8 flag on **
test 01: expression: (?-xism:^([^\s]*?)$); didn't match
test 02: expression:   (?-xism:([^\s]*)?); $1 = []; length of $1 = 0
test 03: expression:    (?-xism:([^\s]*)); $1 = []; length of $1 = 0
test 04: expression:    (?-xism:([^\s]?)); $1 = []; length of $1 = 0
test 05: expression:    (?-xism:([^\s])?); $1 = []; length of $1 = 0
test 06: expression:   (?-xism:([^\s]+?)); didn't match
test 07: expression:   (?-xism:([^\s]+)?); $1 = []; length of $1 = 0
test 08: expression:    (?-xism:([^\s]+)); didn't match
test 09: expression: (?-xism:([^\s]{1,})); didn't match
test 10: expression:     (?-xism:([^\s])); $1 = [a]; length of $1 = 1
** UTF8 flag off **
test 01: expression: (?-xism:^([^\s]*?)$); $1 = [a]; length of $1 = 1
test 02: expression:   (?-xism:([^\s]*)?); $1 = [a]; length of $1 = 1
test 03: expression:    (?-xism:([^\s]*)); $1 = [a]; length of $1 = 1
test 04: expression:    (?-xism:([^\s]?)); $1 = [a]; length of $1 = 1
test 05: expression:    (?-xism:([^\s])?); $1 = [a]; length of $1 = 1
test 06: expression:   (?-xism:([^\s]+?)); $1 = [a]; length of $1 = 1
test 07: expression:   (?-xism:([^\s]+)?); $1 = [a]; length of $1 = 1
test 08: expression:    (?-xism:([^\s]+)); $1 = [a]; length of $1 = 1
test 09: expression: (?-xism:([^\s]{1,})); $1 = [a]; length of $1 = 1
test 10: expression:     (?-xism:([^\s])); $1 = [a]; length of $1 = 1


("problem" machine is running RedHat 8; also reproduced on a RedHat 9 machine running Perl 5.8.0 with the same version of Encode. Both versions of Perl had the MAINT18379 patch applied, and I'm beginning to think that has something to do with this...)


------------- perl -V from working machine -------------------

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.20-ci, archname=i686-linux
uname='linux rd.propagation.net 2.4.20-ci #12 smp fri dec 13 22:52:24 cst 2002 i686 unknown '
    config_args='-Umyalloc -des'
    hint=previous, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2',
cppflags='-fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm' ccversion='', gccversion='2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil
    libc=/lib/libc-2.2.4.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.2.4'
  Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/local/lib/perl5/5.8.0/i686-linux/CORE'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: USE_LARGE_FILES
  Built under linux
  Compiled at Jul  5 2003 10:09:25
  @INC:
    /usr/local/lib/perl5/5.8.0/i686-linux
    /usr/local/lib/perl5/5.8.0
    /usr/local/lib/perl5/site_perl/5.8.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.8.0
    /usr/local/lib/perl5/site_perl
    .


------------- Test script results from working machine -------------------

Perl version = 5.008; Encode version = 1.99
character info: string = [a]; length = 1
byte info: string = [a]; length = 1
** UTF8 flag on **
test 01: expression: (?-xism:^([^\s]*?)$); $1 = [a]; length of $1 = 1
test 02: expression:   (?-xism:([^\s]*)?); $1 = [a]; length of $1 = 1
test 03: expression:    (?-xism:([^\s]*)); $1 = [a]; length of $1 = 1
test 04: expression:    (?-xism:([^\s]?)); $1 = [a]; length of $1 = 1
test 05: expression:    (?-xism:([^\s])?); $1 = [a]; length of $1 = 1
test 06: expression:   (?-xism:([^\s]+?)); $1 = [a]; length of $1 = 1
test 07: expression:   (?-xism:([^\s]+)?); $1 = [a]; length of $1 = 1
test 08: expression:    (?-xism:([^\s]+)); $1 = [a]; length of $1 = 1
test 09: expression: (?-xism:([^\s]{1,})); $1 = [a]; length of $1 = 1
test 10: expression:     (?-xism:([^\s])); $1 = [a]; length of $1 = 1
** UTF8 flag off **
test 01: expression: (?-xism:^([^\s]*?)$); $1 = [a]; length of $1 = 1
test 02: expression:   (?-xism:([^\s]*)?); $1 = [a]; length of $1 = 1
test 03: expression:    (?-xism:([^\s]*)); $1 = [a]; length of $1 = 1
test 04: expression:    (?-xism:([^\s]?)); $1 = [a]; length of $1 = 1
test 05: expression:    (?-xism:([^\s])?); $1 = [a]; length of $1 = 1
test 06: expression:   (?-xism:([^\s]+?)); $1 = [a]; length of $1 = 1
test 07: expression:   (?-xism:([^\s]+)?); $1 = [a]; length of $1 = 1
test 08: expression:    (?-xism:([^\s]+)); $1 = [a]; length of $1 = 1
test 09: expression: (?-xism:([^\s]{1,})); $1 = [a]; length of $1 = 1
test 10: expression:     (?-xism:([^\s])); $1 = [a]; length of $1 = 1

(working machine is running RedHat 7.2)


I also ran this same script on Perl 5.8.3 (on Mac OS X 10.3.2) and the results were the same as the working machine.

I was originally using Encode::decode to produce my utf8-encoded string and I thought that was the problem, so I replaced it with a simple 7-bit string and just flipped on the utf8 flag and was shocked to see the same results.

Can anyone explain the discrepancy between these two? I'm thinking it's a bug in Perl because I upgraded one of the machines exhibiting the problem to Encode 1.99 and the problem persists.

Ideally, I would like to find a solution that lets me use Encode and utf8 strings when it will NOT produce these oddball side-effects and not bother with utf8 strings in cases where it would be an issue.

-Brad


--
http://bradchoate.com/



<Prev in Thread] Current Thread [Next in Thread>