This may be related to an apparent 5.8.0 bug that was discussed back in
February (the thread was "Odd regex behavior", started by Markus Kuhn),
but I'm not sure...
Consider a utf8 file containing just three characters and a line-feed,
where the first character happens to be "wide" (two bytes in utf8, that
represent U+00C0, "LATIN CAPITAL LETTER A WITH GRAVE"):
$ cat test.utf8
À>A
I think my mail editor will be giving you the iso-latin1 code "\xC0"
rather the two-byte utf8 sequence, but rest assured, the file is utf8:
$ od -t x1 test.utf8
0000000 c3 80 3e 41 0a
According to the 5.8.0 perlunicode man page:
o Regular expressions match characters instead of bytes.
"." matches a character instead of a byte. ...
So, given the data sample in "test.utf8", the following substitutions
should produce the same result (assuming that each is applied to the
original string of course), but they don't:
s/(.)/: $1 :/; # produces ": À :>A\n"
s/(.)(.)/: $1 :$2/; # same as above
s/(.)>/: $1 :>/; # HAS NO EFFECT (left side did not match?!)
Or these:
s/(..)(.)/$1: $2 :/; # produces "À>: A :\n"
s/(..)(A)/$1: $2 :/; # HAS NO EFFECT (left side did not match?!)
I get the same results whether or not the "use utf8;" pragma is present;
the string is being read and printed after doing binmode( ..., ":utf8")
on STDIN and STDOUT.
Is this something that will be fixed in 5.8.1?
Dave Graff