Le 31 déc. 03, à 16:28, terry(_at_)eatoni(_dot_)com a écrit :
Why are you using:
use encoding 'utf8';
?
So that, for the sake of keeping the snippet short,
Perl would know that my character constant was in
utf-8, and that the "print" statements would output
utf-8 as well. I typed the source code in an utf-8
editor, and used a utf-8 terminal to run it.
I apologize for not making this clear.
Without it, perl 5.8.1, I see output:
1 ß
2 ß
3 Gro
Without the "use encoding" Perl is just doing bytes,
you lose the unicode character semantics and end up
with "3 Gro" which is wrong, Großbritannien is one word.
When I run with your use encoding 'utf8'; I get an error from perl:
Malformed UTF-8 character (unexpected non-continuation byte 0x62,
immediately after start byte 0xdf) in pattern match (m//) at /tmp/w.pl
line 9.
So you have 0xdf 0x62 which is ßb in latin1. My sample
assumes utf-8, in utf-8 ßb is 0xc3 0x9f 0x62.
In other words you're not running the same code as I am.
With such a latin1 source code and of course dropping
the "use encoding" line, the character constant needs to
be explicitely decoded to unicode:
$x = Encode::decode("iso-8859-1", "Großbritannien");
...which yields the same results of course:
1
2 ß
3 Großbritannien
------------------------------------------
#!/usr/bin/perl -w
use strict;
use encoding 'utf8';
my $x = 'Großbritannien';
$\ = "\n";
print '1 ', $x =~ /(\W+)/;
print '2 ', $x =~ /([\W]+)/;
print '3 ', $x =~ /(\w+)/;
exit(0);
--
Eric Cholet