When regex "dot" doesn't work on unicode characters


This may be related to an apparent 5.8.0 bug that was discussed back in
February (the thread was "Odd regex behavior", started by Markus Kuhn), 
but I'm not sure...

Consider a utf8 file containing just three characters and a line-feed, 
where the first character happens to be "wide" (two bytes in utf8, that 
represent U+00C0, "LATIN CAPITAL LETTER A WITH GRAVE"):

$ cat test.utf8
À>A

I think my mail editor will be giving you the iso-latin1 code "\xC0"
rather the two-byte utf8 sequence, but rest assured, the file is utf8:

$ od -t x1 test.utf8
0000000 c3 80 3e 41 0a

According to the 5.8.0 perlunicode man page:

     o   Regular expressions match characters instead of bytes.
         "." matches a character instead of a byte. ...

So, given the data sample in "test.utf8", the following substitutions 
should produce the same result (assuming that each is applied to the 
original string of course), but they don't:

  s/(.)/: $1 :/;  # produces ": À :>A\n"

  s/(.)(.)/: $1 :$2/;  # same as above

  s/(.)>/: $1 :>/;  # HAS NO EFFECT (left side did not match?!)

Or these:

  s/(..)(.)/$1: $2 :/;  # produces "À>: A :\n"

  s/(..)(A)/$1: $2 :/;  # HAS NO EFFECT (left side did not match?!)

I get the same results whether or not the "use utf8;" pragma is present;
the string is being read and printed after doing binmode( ..., ":utf8")
on STDIN and STDOUT.

Is this something that will be fixed in 5.8.1?

        Dave Graff