perl-unicode

When regex "dot" doesn't work on unicode characters

2003-07-03 14:30:05

This may be related to an apparent 5.8.0 bug that was discussed back in
February (the thread was "Odd regex behavior", started by Markus Kuhn), 
but I'm not sure...

Consider a utf8 file containing just three characters and a line-feed, 
where the first character happens to be "wide" (two bytes in utf8, that 
represent U+00C0, "LATIN CAPITAL LETTER A WITH GRAVE"):

$ cat test.utf8
À>A

I think my mail editor will be giving you the iso-latin1 code "\xC0"
rather the two-byte utf8 sequence, but rest assured, the file is utf8:

$ od -t x1 test.utf8
0000000 c3 80 3e 41 0a

According to the 5.8.0 perlunicode man page:

     o   Regular expressions match characters instead of bytes.
         "." matches a character instead of a byte. ...

So, given the data sample in "test.utf8", the following substitutions 
should produce the same result (assuming that each is applied to the 
original string of course), but they don't:

  s/(.)/: $1 :/;  # produces ": À :>A\n"

  s/(.)(.)/: $1 :$2/;  # same as above

  s/(.)>/: $1 :>/;  # HAS NO EFFECT (left side did not match?!)

Or these:

  s/(..)(.)/$1: $2 :/;  # produces "À>: A :\n"

  s/(..)(A)/$1: $2 :/;  # HAS NO EFFECT (left side did not match?!)

I get the same results whether or not the "use utf8;" pragma is present;
the string is being read and printed after doing binmode( ..., ":utf8")
on STDIN and STDOUT.

Is this something that will be fixed in 5.8.1?

        Dave Graff


<Prev in Thread] Current Thread [Next in Thread>