Re: Another Unicode s/// buglet?

SADAHIRO Tomoyuki <bqw10602(_at_)nifty(_dot_)com> wrote:
:With Perl 5.8.0 RC2 (or plus Change 17353),
:there is something strange.
:
:In $unicode =~ s/$regex/$bytes/,
:$bytes is not upgraded,
:and a malformed Unicode string is generated.
:
:$unicode =~ s/$regex/$bytes/e is ok, though.

As far as I can tell, this is missing code rather than buggy code:
coping with a non-utf8 replacement string does not seem to have
been catered for in this class of cases.

Attached patch passes all existing tests here, as well as some new ones.

Due to the current RC status, I've taken the simplest approach I could
see, but there may be higher performance alternatives: the upgrade is
done regardless of whether the replacement string is ever needed, and
since it is not done in place, the upgrade will be repeated each time
it is needed. That means if you expect to perform the same substitution
on many utf8 strings, it would probably be faster if you ensure that
the replacement string is utf8.

Hugo
--- pp_hot.c.old        Tue Jun 25 17:21:07 2002
+++ pp_hot.c    Wed Jun 26 17:32:55 2002
@@ -1983,8 +1983,16 @@
 
     /* known replacement string? */
     if (dstr) {
-        c = SvPV(dstr, clen);
        doutf8 = DO_UTF8(dstr);
+       if (doutf8 || !PL_reg_match_utf8) {
+           c = SvPV(dstr, clen);
+       } else {
+           SV* sv = sv_newmortal();
+           SvSetMagicSV(sv, dstr);
+           sv_utf8_upgrade(sv);
+           c = SvPV(sv, clen);
+           doutf8 = TRUE;
+       }
     }
     else {
         c = Nullch;
--- t/op/subst.t.old    Tue Jun 25 17:21:07 2002
+++ t/op/subst.t        Wed Jun 26 17:34:25 2002
@@ -7,7 +7,7 @@
 }
 
 require './test.pl';
-plan( tests => 92 );
+plan( tests => 106 );
 
 $x = 'foo';
 $_ = "x";
@@ -401,3 +401,43 @@
     like($a, qr/単/, "use utf8 LHS and RHS");
 }
 
+{
+    # subst with mixed utf8/non-utf8 type
+    my($ua, $ub, $uc, $ud) = ("\x{101}", "\x{102}", "\x{103}", "\x{104}");
+    my($na, $nb) = ("\x{ff}", "\x{fe}");
+    my $a = "$ua--$ub";
+    my $b;
+    ($b = $a) =~ s/--/$na/;
+    is($b, "$ua$na$ub", "s///: replace non-utf8 into utf8");
+    ($b = $a) =~ s/--/--$na--/;
+    is($b, "$ua--$na--$ub", "s///: replace long non-utf8 into utf8");
+    ($b = $a) =~ s/--/$uc/;
+    is($b, "$ua$uc$ub", "s///: replace utf8 into utf8");
+    ($b = $a) =~ s/--/--$uc--/;
+    is($b, "$ua--$uc--$ub", "s///: replace long utf8 into utf8");
+    $a = "$na--$nb";
+    ($b = $a) =~ s/--/$ua/;
+    is($b, "$na$ua$nb", "s///: replace utf8 into non-utf8");
+    ($b = $a) =~ s/--/--$ua--/;
+    is($b, "$na--$ua--$nb", "s///: replace long utf8 into non-utf8");
+
+    # now with utf8 pattern
+    $a = "$ua--$ub";
+    ($b = $a) =~ s/-($ud)?-/$na/;
+    is($b, "$ua$na$ub", "s///: replace non-utf8 into utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$na--/;
+    is($b, "$ua--$na--$ub", "s///: replace long non-utf8 into utf8 (utf8 
pattern)");
+    ($b = $a) =~ s/-($ud)?-/$uc/;
+    is($b, "$ua$uc$ub", "s///: replace utf8 into utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$uc--/;
+    is($b, "$ua--$uc--$ub", "s///: replace long utf8 into utf8 (utf8 
pattern)");
+    $a = "$na--$nb";
+    ($b = $a) =~ s/-($ud)?-/$ua/;
+    is($b, "$na$ua$nb", "s///: replace utf8 into non-utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$ua--/;
+    is($b, "$na--$ua--$nb", "s///: replace long utf8 into non-utf8 (utf8 
pattern)");
+    ($b = $a) =~ s/-($ud)?-/$na/;
+    is($b, "$na$na$nb", "s///: replace non-utf8 into non-utf8 (utf8 pattern)");
+    ($b = $a) =~ s/-($ud)?-/--$na--/;
+    is($b, "$na--$na--$nb", "s///: replace long non-utf8 into non-utf8 (utf8 
pattern)");
+}