On Thu, Mar 21, 2002 at 10:07:09AM +0100, Andreas J. Koenig wrote:
Larry's recent favorite bug posting has yielded fruit, very nice
indeed, thanks. But now I read the recently edited paragraph from
perlunicode.pod:
If the keys of a hash are "mixed", that is, some keys are Unicode,
while some keys are "byte", the keys may behave differently in regular
expressions since the definition of character classes like C</\w/>
is different for byte strings and character strings. This problem can
sometimes be helped by using an appropriate locale (see L<perllocale>).
Another way is to force all the strings to be character encoded by
using utf8::upgrade() (see L<utf8>).
My headache starts with the last sentence. The whole truth would be
Another way is to force all the strings to be character encoded by
using utf8::upgrade() WHENEVER YOU ARE GOING TO USE A REGULAR
EXPRESSION WITH CHARACTER SEMANTICS.
Without the locale thingy, it will not suffice to make sure, all
strings are upgraded to Unicode, you will also need to make sure, they
are *still* upgraded whenever you use a regular expression with a
character class.
Demonstration:
% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
my $u = "f\x{df}";
require utf8;
utf8::upgrade($u);
my %u = ( $u => $u ); # might happen in a module too
for (keys %u){
yes, I meant utf8::upgrade here for the $_, since the keys are
the mixed ones.
The problem is that the keys get downgraded and stored as such,
once that's done there's not knowing that the key was originally
in UTF-8. I agree that it's very un-Perlish but the problem is this
%u = ();
$u{"\xFF"} = 42;
$a = "\xFF\x{100}";
chop $a;
$u{$a} = 43;
I think you agree that there should only one key in the %u hash now,
and ord() that key should be 0xFF, and the corresponding value should
be 43 . The current implementation achieves this by downgrading the
keys.
my $m1 = /^\w*$/;
my $m2 = $u{$_}=~/^\w*$/;
print $m1==$m2 ? "ok\n" : "not ok\n";
}
'
not ok
See, upgrading once is not enough, you need to upgrade everywhere you
use a regular expression with character semantics:
% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
my $u = "f\x{df}";
require utf8;
utf8::upgrade($u);
my %u = ( $u => $u ); # might happen in a module too
for (keys %u){
utf8::upgrade($_); ####
utf8::upgrade($u{$_}); #### 2 lines added
my $m1 = /^\w*$/;
my $m2 = $u{$_}=~/^\w*$/;
print $m1==$m2 ? "ok\n" : "not ok\n";
}
'
ok
I'm sure everybody will agree that this is not only unperlish, it is
unbearable and falls back behind 5.005_50. For that reason I would
suggest to drop the mention of utf8::upgrade here, maybe thusly:
--- pod/perlunicode.pod~ Thu Mar 21 08:15:43 2002
+++ pod/perlunicode.pod Thu Mar 21 09:59:33 2002
@@ -966,9 +966,7 @@
while some keys are "byte", the keys may behave differently in regular
expressions since the definition of character classes like C</\w/>
is different for byte strings and character strings. This problem can
-sometimes be helped by using an appropriate locale (see L<perllocale>).
-Another way is to force all the strings to be character encoded by
-using utf8::upgrade() (see L<utf8>).
+be helped by using an UTF-8 locale (see L<perllocale>).
Some functions are slower when working on UTF-8 encoded strings than
on byte encoded strings. All functions that need to hop over
Another possibility is, of course, that the demonstrated behaviour is
a vanilla bug and gets fixed before 5.8.0. :-/
--
andreas
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen