On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:
David E. Wheeler schrieb am 15.06.2010 um 22:55 (-0700):
But the curious thing is, when I pull the offending string out of
the RSS and just stick it in a script, Encode knows how to decode it
properly, while XML::LibXML (and my Unicode-aware editors) cannot.
Try passing the parser options as a hash reference:
my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});
WTF! That fixes it! I don't understand why it seems to be ignoring the encoding
set in the constructor. But I've noticed the same thing with other options.
Seems like there's some consistency to be worked out in XML::LibXML options,
still.
In order to print Unicode text strings (as opposed to octet strings)
correctly to a terminal (UTF-8 or not), add the following line before
the first output:
binmode STDOUT, ':utf8';
But note that STDOUT is global.
Yes, I do this all the time. Surprisingly, I don't get warnings for this
script, even though it is outputting multibyte characters.
Thanks,
David