perl-unicode

perlunicode.pod mention of utf8::upgrade questionable

2002-03-21 02:07:35
Larry's recent favorite bug posting has yielded fruit, very nice
indeed, thanks. But now I read the recently edited paragraph from
perlunicode.pod:

    If the keys of a hash are "mixed", that is, some keys are Unicode,
    while some keys are "byte", the keys may behave differently in regular
    expressions since the definition of character classes like C</\w/>
    is different for byte strings and character strings.  This problem can
    sometimes be helped by using an appropriate locale (see L<perllocale>).
    Another way is to force all the strings to be character encoded by
    using utf8::upgrade() (see L<utf8>).

My headache starts with the last sentence. The whole truth would be

    Another way is to force all the strings to be character encoded by
    using utf8::upgrade() WHENEVER YOU ARE GOING TO USE A REGULAR
    EXPRESSION WITH CHARACTER SEMANTICS.

Without the locale thingy, it will not suffice to make sure, all
strings are upgraded to Unicode, you will also need to make sure, they
are *still* upgraded whenever you use a regular expression with a
character class.

Demonstration:

% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
  my $u = "f\x{df}";
  require utf8;
  utf8::upgrade($u);
  my %u = ( $u => $u );            # might happen in a module too
  for (keys %u){
    my $m1 = /^\w*$/;
    my $m2 = $u{$_}=~/^\w*$/;
    print $m1==$m2 ? "ok\n" : "not ok\n";                
  }
'
not ok


See, upgrading once is not enough, you need to upgrade everywhere you
use a regular expression with character semantics:

% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
  my $u = "f\x{df}";
  require utf8;
  utf8::upgrade($u);
  my %u = ( $u => $u );            # might happen in a module too
  for (keys %u){
    utf8::upgrade($_);             ####
    utf8::upgrade($u{$_});         ####  2 lines added
    my $m1 = /^\w*$/;            
    my $m2 = $u{$_}=~/^\w*$/;            
    print $m1==$m2 ? "ok\n" : "not ok\n";
  }
'
ok


I'm sure everybody will agree that this is not only unperlish, it is
unbearable and falls back behind 5.005_50. For that reason I would
suggest to drop the mention of utf8::upgrade here, maybe thusly:

--- pod/perlunicode.pod~        Thu Mar 21 08:15:43 2002
+++ pod/perlunicode.pod Thu Mar 21 09:59:33 2002
@@ -966,9 +966,7 @@
 while some keys are "byte", the keys may behave differently in regular
 expressions since the definition of character classes like C</\w/>
 is different for byte strings and character strings.  This problem can
-sometimes be helped by using an appropriate locale (see L<perllocale>).
-Another way is to force all the strings to be character encoded by
-using utf8::upgrade() (see L<utf8>).
+be helped by using an UTF-8 locale (see L<perllocale>).
 
 Some functions are slower when working on UTF-8 encoded strings than
 on byte encoded strings. All functions that need to hop over




Another possibility is, of course, that the demonstrated behaviour is
a vanilla bug and gets fixed before 5.8.0.  :-/



-- 
andreas

<Prev in Thread] Current Thread [Next in Thread>