perl-unicode

Re: Matching upper ASCII characters in RE patterns

2010-11-30 15:19:15
Jonathan Pool wrote:
Thanks very much for your further information about this issue.

I'll be happy to file a bug report, but I should also mention that the problematic behavior not 
only exists with "use encoding 'utf8'" and "use utf8", but differs between 
them. Both produce wrong results, but different wrong results:


Just one bug report will be fine. I don't have a Perl 5.10 laying around to test on, but I can say that the files I sent you did what I said on 5.13.7. I think that the one that was supposedly in latin1 could have gotten converted to utf8 in the email process. There have been many significant bug fixes in Perl since 5.10.0.

With “use encoding 'utf8'”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is NOT matched by /[\xa0]/
The NBS is NOT matched by /\xa0/
The NBS is NOT matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is matched by / / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

With neither “use encoding 'utf8'” nor “use utf8”:
The NBS is matched by /\N{NO-BREAK SPACE}/
The NBS is NOT matched by /Â / (a no-break space)
The NBS is matched by /[\7f-\x80]/
The NBS is matched by /[\xa0]/
The NBS is matched by /\xa0/
The NBS is matched by /\N{U+00a0}/
The NBS is NOT matched by /junk/

(The 3rd and 7th patterns, out of 7, should fail.)

(If I include both statements, the behavior is the same as if "use encoding 'utf8'" alone is 
present. This testing is with "<:encoding(utf8)".)

So, I'm confused as to whether this is 1 bug or more than 1, and how best to 
document it (or them). Could you advise me on this?

On 30 Nov 2010, at 10:25, karl williamson wrote:

Jonathan Pool wrote:
Let's say the character NO-BREAK SPACE (U+00A0) appears in a UTF8-encoded text 
file (so it appears there as C2A0), and I want to match strings that contain 
this character.
I write a script (itself encoded with UTF8) in Perl 5.10.0 (on OS X 10.6.5) 
with:
use encoding 'utf8';
use charnames ':full:';
The script opens the file with:
open FH, '<:utf8', filename.txt;
You should always use '<:encoding(utf8)' instead to get utf8 validation.
But that's not the problem here.
I tested it on the very latest development code, and it still fails. The problem is a bug 
or bugs in Perl with parsing files encoded in utf8.  I converted the .pl to latin1 and 
removed the "use encoding 'utf8'", and it works.

I believe it is known that there are issues with 'use encoding', but I suggest 
filing a bug report, by sending email to perlbug(_at_)perl(_dot_)org(_dot_) 
Attached are two files I created to test.  These should be attached to the bug 
report so as to not have to be done again.
It reads lines in with:
while <FH> {}
Then, in a regular expression in the script, I can match the NO-BREAK SPACE 
with any of these patterns:
1. /\N{NO-BREAK SPACE}/
2. / / (where the character between slashes looks like a space but is a 
no-break space)
3. /[\x7f-\x80]/
Patterns 1 and 2 make sense, but pattern 3 is mysterious to me, because the 
range specified in pattern 3 includes DELETE and an unnamed character but does 
not include NO-BREAK SPACE.
Moreover, I expect to be able to match the NO-BREAK SPACE with these patterns, 
but I cannot:
4. /[\xa0]/
5. /\xa0/
In the related documentation, I have not found anything explaining why pattern 
3 works, or anything explaining why patterns 4 and 5 do not work.
I have replicated these anomalies in Perl 5.8.8. under Red Hat Enterprise Linux 
5.
I would be delighted to receive explanations or references to documentation 
that I have overlooked or misunderstood.
<nobreak_latin1.pl><nobreak_utf8.pl>

ˉ