Re: Character (or byte?) escapes under utf8 pragma

Am 10.03.2010 um 11:02 schrieb Juerd Waalboer:

Michael Ludwig skribis 2010-03-10 10:34 (+0100):

Okay. Let me try to see if I have understood correctly. Without the utf8
pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence
of two bytes in my source code will be stored internally as a sequence
of 12 integers. With the utf8 pragma in scope, only 11 integers.


I think I got confused about bytes and integers now, because I misread
an earlier post by Aristoteles. What I meant is:

With the utf8 pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored
as a sequence of two bytes in my source code will be stored internally as
a sequence of 11 integers. (But I shouldn't care about the integers, that's
an implementation detail.) Without the utf8 pragma in scope, the string will
be stored as a sequence of 12 bytes; and 11 bytes if I convert the source to
Latin-1.

In the broken perl versions, like 5.8.9 and 5.10.0, with the utf8 pragma
in scope I get the wrong sequence of 11 integers, as per your illustration
quoted below: I get a0 where I should get c2-a0, because those perl versions
don't handle character escapes correctly.

"so\xa0ein\xa0Käse" must be stored as either:

   l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off)

or:

   u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on)


Yes (modulo typo):

so ein Käse:    73 6f c2-a0 65 69 6e c2-a0 4b c3-a4 73 65
so?ein?Käse:    73 6f c2-a0 65 69 6e c2-a0 4b c3-83 c2-a4 73 65

----
use common::sense; # includes utf8 pragma
use open OUT => qw/:encoding(UTF-8) :std/;
use Encode;

sub show_bytes {
    my $str = shift;
    my $out = '';
    for ( split '', $str ) {
        my $octets = Encode::encode( 'UTF-8', $_ );
        $out .= join '-', map sprintf( '%x', ord), split '', $octets;
        $out .= ' ';
    }
    return $out;
}

print STDERR "Kaputt in Perl 5.8.9 und 5.10.0!\n"; # heile in 5.10.1

my $sok = "so\xa0ein\xa0Käse";

print $_, ":\t", show_bytes( $_ ), "\n" for $sok;
----

Both strings should be semantically equal, and have 11 characters, each
of which has an integer ordinal value.

What happens is the following:

   73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on)
         l1          l1     u8

This is wrong. It is a bug.


Very graphical and palpable exposition, thanks!

-- 
Michael.Ludwig (#) XING.com