perl-unicode

Re: Odd regexp behavior

2003-02-26 13:30:06

Markus(_dot_)Kuhn(_at_)cl(_dot_)cam(_dot_)ac(_dot_)uk said:
$ perl -e '$x = "\x{2019}\nk"; $x =~ s/(\S)\n(\S)/$1 $2/sg; print "$x\n";'
'    <= this denotes a \x{2019} followed by \n
k $ perl -e

$ perl -e '$x = "b\nk"; $x =~ s/(\S)\n(\S)/$1 $2/sg; print "$x\n";'
b k 

[snip]

$ perl -e 'print (("\x{2019}" =~ /\S/) . "\n");'
1

This behavior certainly does seem to contradict expectations.  I even 
thought that the third test might not be exactly equivalent to the 
first, so I tried this:

$ perl -e '$x = "\x{2019}"; print "x2019 matches \\S\n" if ( $x =~ /\S/ );'
x2019 matches \S


But since perl provides many ways of doing the same thing (or at least 
trying to), there is an "idiom" that will produce the expected result:

 require 5.008;

 use Encode;

 $x = encode( "utf8", "\x{2019}\nk" );
 $x =~ s/(\S)\n(\S)/$1 $2/sg;
 print "$x\n";

 __END__

 __OUTPUT__
 ' k

Even in this case, I was puzzled as to why I got the expected behavior
by using the "encode()" method this way, but not when I used "decode()"
instead. (I should have expected it to be the other way around?)
Go figure...

        Dave Graff


<Prev in Thread] Current Thread [Next in Thread>