Re: perlunicode.pod mention of utf8::upgrade questionable


Hello.

On Thu, 21 Mar 2002 10:07:09 +0100
andreas(_dot_)koenig(_at_)anima(_dot_)de (Andreas J. Koenig) wrote:

Larry's recent favorite bug posting has yielded fruit, very nice
indeed, thanks. But now I read the recently edited paragraph from
perlunicode.pod:

    If the keys of a hash are "mixed", that is, some keys are Unicode,
    while some keys are "byte", the keys may behave differently in regular
    expressions since the definition of character classes like C</\w/>
    is different for byte strings and character strings.  This problem can
    sometimes be helped by using an appropriate locale (see L<perllocale>).
    Another way is to force all the strings to be character encoded by
    using utf8::upgrade() (see L<utf8>).

My headache starts with the last sentence. The whole truth would be

    Another way is to force all the strings to be character encoded by
    using utf8::upgrade() WHENEVER YOU ARE GOING TO USE A REGULAR
    EXPRESSION WITH CHARACTER SEMANTICS.

Without the locale thingy, it will not suffice to make sure, all
strings are upgraded to Unicode, you will also need to make sure, they
are *still* upgraded whenever you use a regular expression with a
character class.

Demonstration:

% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
  my $u = "f\x{df}";
  require utf8;
  utf8::upgrade($u);
  my %u = ( $u => $u );            # might happen in a module too
  for (keys %u){
    my $m1 = /^\w*$/;
    my $m2 = $u{$_}=~/^\w*$/;
    print $m1==$m2 ? "ok\n" : "not ok\n";                
  }
'
not ok


hmm, but such a test says ok.

#!perl
  my $u = "f\x{df}";
  utf8::upgrade($u);
  my %u = ( $u => $u );    # might happen in a module too
  
  my $m1 = $u =~ /^\w*$/;
  my $m2 = $u{$u} =~ /^\w*$/;
  print $m1==$m2 ? "ok\n" : "not ok\n";

__END__


See, upgrading once is not enough, you need to upgrade everywhere you
use a regular expression with character semantics:

% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
  my $u = "f\x{df}";
  require utf8;
  utf8::upgrade($u);
  my %u = ( $u => $u );            # might happen in a module too
  for (keys %u){
    utf8::upgrade($_);             ####
    utf8::upgrade($u{$_});         ####  2 lines added
    my $m1 = /^\w*$/;            
    my $m2 = $u{$_}=~/^\w*$/;            
    print $m1==$m2 ? "ok\n" : "not ok\n";
  }
'
ok


Hash keys seem to be stored after downgraded...
Then, necessity is only one line added, isn't it?

#!perl
   my $u = "f\x{df}";
   utf8::upgrade($u);
   my %u = ( $u => $u );
   for (keys %u){
     utf8::upgrade($_);
     my $m1 = /^\w*$/;
     my $m2 = $u{$_}=~/^\w*$/;
     print $m1==$m2 ? "ok\n" : "not ok\n";
  }

__END__

Nevertheless, we shouldn't distinguish Unicode-ness of hash keys;
otherwise we'd be upset more... :-)

#!perl
use charnames qw(:full);

my $alpha = "\N{GREEK SMALL LETTER ALPHA}";
   # "\x{945}" = "\xCE\xB1" UTF8

my $latin =
  "\N{LATIN CAPITAL LETTER I WITH CIRCUMFLEX}\N{PLUS-MINUS SIGN}";
   # "\xCE\xB1" Bytes

my %hash;
$hash{$alpha} = "foo";
$hash{$latin} = "bar";

print $hash{$alpha} eq $hash{$latin} ? "not ok" : "ok";

# Perl 5.6.1 says "not ok",
# while Perl 5.7.3 says "ok".

I'm sure everybody will agree that this is not only unperlish, it is
unbearable and falls back behind 5.005_50. For that reason I would
suggest to drop the mention of utf8::upgrade here, maybe thusly:


\p{Word} seems always to work Unicode-oriented \w.
Can it be a solution?

#!perl
  my $u = "f\x{df}";
  my %u = ( $u => $u );
  for (keys %u){
     my $m1 = /^\p{Word}*$/;
     my $m2 = $u{$_}=~/^\p{Word}*$/;
     print $m1 && $m2 ? "ok\n" : "not ok\n";
  }
  # naturaly we never wish both $m1 and $m2 are false.

--- pod/perlunicode.pod~      Thu Mar 21 08:15:43 2002
+++ pod/perlunicode.pod       Thu Mar 21 09:59:33 2002
@@ -966,9 +966,7 @@
 while some keys are "byte", the keys may behave differently in regular
 expressions since the definition of character classes like C</\w/>
 is different for byte strings and character strings.  This problem can
-sometimes be helped by using an appropriate locale (see L<perllocale>).
-Another way is to force all the strings to be character encoded by
-using utf8::upgrade() (see L<utf8>).
+be helped by using an UTF-8 locale (see L<perllocale>).
 
 Some functions are slower when working on UTF-8 encoded strings than
 on byte encoded strings. All functions that need to hop over




Another possibility is, of course, that the demonstrated behaviour is
a vanilla bug and gets fixed before 5.8.0.  :-/



-- 
andreas


Sincerely
SADAHIRO Tomoyuki