perl-unicode

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-19 06:02:51
David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700):
On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:
On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:

In order to print Unicode text strings (as opposed to octet
strings) correctly to a terminal (UTF-8 or not), add the following
line before the first output:

binmode STDOUT, ':utf8';

But note that STDOUT is global.

Yes, I do this all the time. Surprisingly, I don't get warnings for
this script, even though it is outputting multibyte characters.

This is key. If I set the binmode on STDOUT to :utf8, the bogus
characters print out bogus. If I set it to :raw, they come out right
after processing by both Encode and XML::LibXML (I'm assuming they're
interpreted as latin-1).

Yes, or as raw, which is equivalent. Any octet is valid Latin-1.

So my question is this: Why isn't Encode dying when it runs into these
characters? They're not valid utf-8, AFAICT. Are they somehow valid
utf8 (that is, valid in Perl's internal format)? Why would they be?

Assuming we're talking about the same thing here: They're not
characters, they're octets. (The Perl documentation seems to make
an effort to conceptually distinguish between *octets* and *bytes*,
but they map to the same thing.) I found it helpful to accept that
the notion of a "UTF-8 character" does not make sense: there are
Unicode characters, but UTF-8 is an encoding, and it deals with
octets.

Here's your script with some modifications to illustrate how things
work:

          \,,,/
          (o o)
------oOOo-(_)-oOOo------
use strict;
use Encode;
use XML::LibXML;
# The script is written in UTF-8, but the utf8 pragma is not turned on.
# So the literals in our script yield octet strings, not text strings.
# (Note that it is probably much more convenient to go with the utf8
# pragma if you write your source code in UTF-8.)
my $octets = '<p>Tomas Laurinavičius</p>';
my $txt    = decode_utf8( $octets );
my $txt2   = "<p>Tomas Laurinavi\x{010d}ius</p>";

die if $txt2 ne $txt;    # they're equal
die if $txt2 eq $octets; # they're not equal

# print raw UTF-8 octets; looks correct on UTF-8 terminal
print $octets, $/;
# print text containing wide character to narrow character filehandle
print "$txt WARN$/"; # triggers a warning: "Wide character in print"
binmode STDOUT, ':utf8'; # set to utf8, accepting wide characters
print $txt, $/; # print text to terminal
print $octets, $/; # double encoding, č as four bytes

my $parser = XML::LibXML->new;
# specify encoding for octet string
my $doc = $parser->parse_html_string($octets, {encoding => 'utf-8'});
print $doc->documentElement->toString, $/;
# no need to specify encoding for text string
my $doc2 = $parser->parse_html_string($txt);
print $doc2->documentElement->toString, $/;
-- 
Michael Ludwig