Re: \W and [\W]

Le 31 déc. 03, à 16:28, terry(_at_)eatoni(_dot_)com a écrit :


Why are you using:

use encoding 'utf8';

?


So that, for the sake of keeping the snippet short,
Perl would know that my character constant was in
utf-8, and that the "print" statements would output
utf-8 as well. I typed the source code in an utf-8
editor, and used a utf-8 terminal to run it.
I apologize for not making this clear.


Without it, perl 5.8.1, I see output:

1 ß
2 ß
3 Gro


Without the "use encoding" Perl is just doing bytes,
you lose the unicode character semantics and end up
with "3 Gro" which is wrong, Großbritannien is one word.

When I run with your use encoding 'utf8'; I get an error from perl:
Malformed UTF-8 character (unexpected non-continuation byte 0x62,immediately after start byte 0xdf) in pattern match (m//) at /tmp/w.plline 9.


So you have 0xdf 0x62 which is ßb in latin1. My sample
assumes utf-8, in utf-8 ßb is 0xc3 0x9f 0x62.

In other words you're not running the same code as I am.
With such a latin1 source code and of course dropping
the "use encoding" line, the character constant needs to
be explicitely decoded to unicode:

$x = Encode::decode("iso-8859-1", "Großbritannien");

...which yields the same results of course:

1
2 ß
3 Großbritannien


------------------------------------------
#!/usr/bin/perl -w

use strict;
use encoding 'utf8';

my $x = 'Großbritannien';
$\ = "\n";

print '1 ', $x =~ /(\W+)/;
print '2 ', $x =~ /([\W]+)/;
print '3 ', $x =~ /(\w+)/;

exit(0);

--
Eric Cholet