perl-unicode

Re: Matching upper ASCII characters in RE patterns

2010-11-30 12:25:53
Jonathan Pool wrote:
Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded text 
file (so it appears there as C2A0), and I want to match strings that contain 
this character.

I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5) 
with:

use encoding 'utf8';
use charnames ':full:';

The script opens the file with:

open FH, '<:utf8', filename.txt;

You should always use '<:encoding(utf8)' instead to get utf8 validation.
But that's not the problem here.
I tested it on the very latest development code, and it still fails. The problem is a bug or bugs in Perl with parsing files encoded in utf8. I converted the .pl to latin1 and removed the "use encoding 'utf8'", and it works.

I believe it is known that there are issues with 'use encoding', but I suggest filing a bug report, by sending email to perlbug(_at_)perl(_dot_)org(_dot_) Attached are two files I created to test. These should be attached to the bug report so as to not have to be done again.

It reads lines in with:

while <FH> {}

Then, in a regular expression in the script, I can match the NO-BREAK SPACE 
with any of these patterns:

1. /\N{NO-BREAK SPACE}/

2. / / (where the character between slashes looks like a space but is a 
no-break space)

3. /[\x7f-\x80]/

Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the 
range specified in pattern 3 includes DELETE and an unnamed character but does 
not include NO-BREAK SPACE.

Moreover, I expect to be able to match the NO-BREAK SPACE with these patterns, 
but I cannot:

4. /[\xa0]/

5. /\xa0/

In the related documentation, I have not found anything explaining why pattern 
3 works, or anything explaining why patterns 4 and 5 do not work.

I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise Linux 
5.

I would be delighted to receive explanations or references to documentation 
that I have overlooked or misunderstood.
ˉ





Attachment: nobreak_latin1.pl
Description: Perl program

Attachment: nobreak_utf8.pl
Description: Perl program