perl-unicode

Re: Matching upper ASCII characters in RE patterns

2010-11-30 13:22:45
Thanks very much for your further information about this issue.

I'll be happy to file a bug report, but I should also mention that the 
problematic behavior not only exists with "use encoding 'utf8'" and "use utf8", 
but differs between them. Both produce wrong results, but different wrong 
results:

With “use encoding 'utf8'”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is NOT matched by /[\xa0]/
The NBS is NOT matched by /\xa0/
The NBS is NOT matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With neither “use encoding 'utf8'” nor “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is NOT matched by /Â / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

(The 3rd and 7th patterns, out of 7, should fail.)

(If I include both statements, the behavior is the same as if "use encoding 
'utf8'" alone is present. This testing is with "<:encoding(utf8)".)

So, I'm confused as to whether this is 1 bug or more than 1, and how best to 
document it (or them). Could you advise me on this?

On 30 Nov 2010, at 10:25, karl williamson wrote:

Jonathan Pool wrote:
Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded 
text file (so it appears there as C2A0), and I want to match strings that 
contain this character.
I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5) 
with:
use encoding 'utf8';
use charnames ':full:';
The script opens the file with:
open FH, '<:utf8', filename.txt;

You should always use '<:encoding(utf8)' instead to get utf8 validation.
But that's not the problem here.
I tested it on the very latest development code, and it still fails. The 
problem is a bug or bugs in Perl with parsing files encoded in utf8.  I 
converted the .pl to latin1 and removed the "use encoding 'utf8'", and it 
works.

I believe it is known that there are issues with 'use encoding', but I 
suggest filing a bug report, by sending email to 
perlbug(_at_)perl(_dot_)org(_dot_) Attached are two files I created to test.  
These should be attached to the bug report so as to not have to be done again.
It reads lines in with:
while <FH> {}
Then, in a regular expression in the script, I can match the NO-BREAK SPACE 
with any of these patterns:
1. /\N{NO-BREAK SPACE}/
2. / / (where the character between slashes looks like a space but is a 
no-break space)
3. /[\x7f-\x80]/
Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the 
range specified in pattern 3 includes DELETE and an unnamed character but 
does not include NO-BREAK SPACE.
Moreover, I expect to be able to match the NO-BREAK SPACE with these 
patterns, but I cannot:
4. /[\xa0]/
5. /\xa0/
In the related documentation, I have not found anything explaining why 
pattern 3 works, or anything explaining why patterns 4 and 5 do not work.
I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise 
Linux 5.
I would be delighted to receive explanations or references to documentation 
that I have overlooked or misunderstood.
<nobreak_latin1.pl><nobreak_utf8.pl> 

ˉ