perl-unicode

Re: perlunicode.pod mention of utf8::upgrade questionable

2002-03-21 07:54:06
On Thu, Mar 21, 2002 at 10:07:09AM +0100, Andreas J. Koenig wrote:
Larry's recent favorite bug posting has yielded fruit, very nice
indeed, thanks. But now I read the recently edited paragraph from
perlunicode.pod:

    If the keys of a hash are "mixed", that is, some keys are Unicode,
    while some keys are "byte", the keys may behave differently in regular
    expressions since the definition of character classes like C</\w/>
    is different for byte strings and character strings.  This problem can
    sometimes be helped by using an appropriate locale (see L<perllocale>).
    Another way is to force all the strings to be character encoded by
    using utf8::upgrade() (see L<utf8>).

My headache starts with the last sentence. The whole truth would be

    Another way is to force all the strings to be character encoded by
    using utf8::upgrade() WHENEVER YOU ARE GOING TO USE A REGULAR
    EXPRESSION WITH CHARACTER SEMANTICS.

Without the locale thingy, it will not suffice to make sure, all
strings are upgraded to Unicode, you will also need to make sure, they
are *still* upgraded whenever you use a regular expression with a
character class.

Demonstration:

% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
  my $u = "f\x{df}";
  require utf8;
  utf8::upgrade($u);
  my %u = ( $u => $u );            # might happen in a module too
  for (keys %u){

yes, I meant utf8::upgrade here for the $_, since the keys are
the mixed ones.

The problem is that the keys get downgraded and stored as such,
once that's done there's not knowing that the key was originally
in UTF-8.  I agree that it's very un-Perlish but the problem is this

        %u = ();
        $u{"\xFF"} = 42;
        $a = "\xFF\x{100}";
        chop $a;
        $u{$a} = 43;

I think you agree that there should only one key in the %u hash now,
and ord() that key should be 0xFF, and the corresponding value should
be 43 .  The current implementation achieves this by downgrading the
keys.

    my $m1 = /^\w*$/;
    my $m2 = $u{$_}=~/^\w*$/;
    print $m1==$m2 ? "ok\n" : "not ok\n";                
  }
'
not ok


See, upgrading once is not enough, you need to upgrade everywhere you
use a regular expression with character semantics:

% /usr/local/perl-5(_dot_)7(_dot_)3(_at_)15380/bin/perl -e '
  my $u = "f\x{df}";
  require utf8;
  utf8::upgrade($u);
  my %u = ( $u => $u );            # might happen in a module too
  for (keys %u){
    utf8::upgrade($_);             ####
    utf8::upgrade($u{$_});         ####  2 lines added
    my $m1 = /^\w*$/;            
    my $m2 = $u{$_}=~/^\w*$/;            
    print $m1==$m2 ? "ok\n" : "not ok\n";
  }
'
ok


I'm sure everybody will agree that this is not only unperlish, it is
unbearable and falls back behind 5.005_50. For that reason I would
suggest to drop the mention of utf8::upgrade here, maybe thusly:

--- pod/perlunicode.pod~      Thu Mar 21 08:15:43 2002
+++ pod/perlunicode.pod       Thu Mar 21 09:59:33 2002
@@ -966,9 +966,7 @@
 while some keys are "byte", the keys may behave differently in regular
 expressions since the definition of character classes like C</\w/>
 is different for byte strings and character strings.  This problem can
-sometimes be helped by using an appropriate locale (see L<perllocale>).
-Another way is to force all the strings to be character encoded by
-using utf8::upgrade() (see L<utf8>).
+be helped by using an UTF-8 locale (see L<perllocale>).
 
 Some functions are slower when working on UTF-8 encoded strings than
 on byte encoded strings. All functions that need to hop over




Another possibility is, of course, that the demonstrated behaviour is
a vanilla bug and gets fixed before 5.8.0.  :-/



-- 
andreas

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen