perl-unicode

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 15:59:43
On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:

On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:

Try passing the parser options as a hash reference:

my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});

WTF! That fixes it! I don't understand why it seems to be ignoring the 
encoding set in the constructor. But I've noticed the same thing with other 
options. Seems like there's some consistency to be worked out in XML::LibXML 
options, still.

Okay, a bit more information: this was not quite it, alas.

In order to print Unicode text strings (as opposed to octet strings)
correctly to a terminal (UTF-8 or not), add the following line before
the first output:

binmode STDOUT, ':utf8';

But note that STDOUT is global.

Yes, I do this all the time. Surprisingly, I don't get warnings for this 
script, even though it is outputting multibyte characters.

This is key. If I set the binmode on STDOUT to :utf8, the bogus characters 
print out bogus. If I set it to :raw, they come out right after processing by 
both Encode and XML::LibXML (I'm assuming they're interpreted as latin-1).

So my question is this: Why isn't Encode dying when it runs into these 
characters? They're not valid utf-8, AFAICT. Are they somehow valid utf8 (that 
is, valid in Perl's internal format)? Why would they be?

I think what I need is some code to strip non-utf8 characters from a string -- 
even if that string has the utf8 bit switched on. I thought that Encode would 
do that for me, but in this case apparently not. Anyone got an example?

Thanks,

David