Am 10.03.2010 um 11:02 schrieb Juerd Waalboer:
Michael Ludwig skribis 2010-03-10 10:34 (+0100):
Okay. Let me try to see if I have understood correctly. Without the utf8
pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence
of two bytes in my source code will be stored internally as a sequence
of 12 integers. With the utf8 pragma in scope, only 11 integers.
I think I got confused about bytes and integers now, because I misread
an earlier post by Aristoteles. What I meant is:
With the utf8 pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored
as a sequence of two bytes in my source code will be stored internally as
a sequence of 11 integers. (But I shouldn't care about the integers, that's
an implementation detail.) Without the utf8 pragma in scope, the string will
be stored as a sequence of 12 bytes; and 11 bytes if I convert the source to
Latin-1.
In the broken perl versions, like 5.8.9 and 5.10.0, with the utf8 pragma
in scope I get the wrong sequence of 11 integers, as per your illustration
quoted below: I get a0 where I should get c2-a0, because those perl versions
don't handle character escapes correctly.
"so\xa0ein\xa0Käse" must be stored as either:
l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off)
or:
u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on)
Yes (modulo typo):
so ein Käse: 73 6f c2-a0 65 69 6e c2-a0 4b c3-a4 73 65
so?ein?Käse: 73 6f c2-a0 65 69 6e c2-a0 4b c3-83 c2-a4 73 65
----
use common::sense; # includes utf8 pragma
use open OUT => qw/:encoding(UTF-8) :std/;
use Encode;
sub show_bytes {
my $str = shift;
my $out = '';
for ( split '', $str ) {
my $octets = Encode::encode( 'UTF-8', $_ );
$out .= join '-', map sprintf( '%x', ord), split '', $octets;
$out .= ' ';
}
return $out;
}
print STDERR "Kaputt in Perl 5.8.9 und 5.10.0!\n"; # heile in 5.10.1
my $sok = "so\xa0ein\xa0Käse";
print $_, ":\t", show_bytes( $_ ), "\n" for $sok;
----
Both strings should be semantically equal, and have 11 characters, each
of which has an integer ordinal value.
What happens is the following:
73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on)
l1 l1 u8
This is wrong. It is a bug.
Very graphical and palpable exposition, thanks!
--
Michael.Ludwig (#) XING.com