perl-unicode

Re: is it utf8 or unicode?

2005-03-14 05:14:20
Andreas,

On Wed, 09 Mar 2005 20:03:08 +0000, 
unicode(_at_)ftumsh(_dot_)demon(_dot_)co(_dot_)uk said:

  > I don't understand what the [UTF8 "\x{c4}"]

"\x{x4}" is valid perl notation for the Unicode character 0xc4.

Yes, but it isn't UTF8. Ok I can live with the Devel Peek label being
incorrect.

% perl -le '
my $data = "\xC4";
binmode STDOUT, ":utf8";
print $data ;
' | od -t x1
0000000 c3 84 0a
0000003

But my data is _not_ \xC4. My data is \xC3\x84. ie valid utf-8.
I expect that when I turn on the utf8 flag for that hex sequence
that it is treated as utf-8. For some strange reason it is converting
it to xC4, which isn't what I'd expect.
I do admit to being a unicode noob, so perhaps my expectations need
adjusting :)

Here's the problem:
I have the data in a db, it is utf-8 encoded so I get it into perl
as \xC3\x84. I turn on the utf-8 flag and then output it as xml
using the module XML::LibXML. The module XML::LibXML has two output
methods, toFH and toString.
If I generate xml using the above data and with an encoding of utf-8,
I get two different files. One is correct (using toFH) the other
isn't (it contains xC4, invalid utf-8).
toFH does not use perl's IO, toString does.
I thought, at first, that the module may be incorrect, however,
when the xml created by toString is parsed in memory, it passes ok.
ie the error occurs during the output. Which means the module is ok.

Now, in spite of Devel::Peeks label, it seems that perl's internal data
is utf-8. I am just curious as to why a :raw binmode would change the
data. If indeed it is, I am after all just guessing here.

John






<Prev in Thread] Current Thread [Next in Thread>