Larry's recent favorite bug posting has yielded fruit, very nice
indeed, thanks. But now I read the recently edited paragraph from
perlunicode.pod:
If the keys of a hash are "mixed", that is, some keys are Unicode,
while some keys are "byte", the keys may behave differently in regular
expressions since the definition of character classes like C</\w/>
is different for byte strings and character strings. This problem can
sometimes be helped by using an appropriate locale (see L<perllocale>).
Another way is to force all the strings to be character encoded by
using utf8::upgrade() (see L<utf8>).
My headache starts with the last sentence. The whole truth would be
Another way is to force all the strings to be character encoded by
using utf8::upgrade() WHENEVER YOU ARE GOING TO USE A REGULAR
EXPRESSION WITH CHARACTER SEMANTICS.
Without the locale thingy, it will not suffice to make sure, all
strings are upgraded to Unicode, you will also need to make sure, they
are *still* upgraded whenever you use a regular expression with a
character class.
Demonstration:
% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
my $u = "f\x{df}";
require utf8;
utf8::upgrade($u);
my %u = ( $u => $u ); # might happen in a module too
for (keys %u){
my $m1 = /^\w*$/;
my $m2 = $u{$_}=~/^\w*$/;
print $m1==$m2 ? "ok\n" : "not ok\n";
}
'
not ok
See, upgrading once is not enough, you need to upgrade everywhere you
use a regular expression with character semantics:
% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
my $u = "f\x{df}";
require utf8;
utf8::upgrade($u);
my %u = ( $u => $u ); # might happen in a module too
for (keys %u){
utf8::upgrade($_); ####
utf8::upgrade($u{$_}); #### 2 lines added
my $m1 = /^\w*$/;
my $m2 = $u{$_}=~/^\w*$/;
print $m1==$m2 ? "ok\n" : "not ok\n";
}
'
ok
I'm sure everybody will agree that this is not only unperlish, it is
unbearable and falls back behind 5.005_50. For that reason I would
suggest to drop the mention of utf8::upgrade here, maybe thusly:
--- pod/perlunicode.pod~ Thu Mar 21 08:15:43 2002
+++ pod/perlunicode.pod Thu Mar 21 09:59:33 2002
@@ -966,9 +966,7 @@
while some keys are "byte", the keys may behave differently in regular
expressions since the definition of character classes like C</\w/>
is different for byte strings and character strings. This problem can
-sometimes be helped by using an appropriate locale (see L<perllocale>).
-Another way is to force all the strings to be character encoded by
-using utf8::upgrade() (see L<utf8>).
+be helped by using an UTF-8 locale (see L<perllocale>).
Some functions are slower when working on UTF-8 encoded strings than
on byte encoded strings. All functions that need to hop over
Another possibility is, of course, that the demonstrated behaviour is
a vanilla bug and gets fixed before 5.8.0. :-/
--
andreas