On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:
On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:
Try passing the parser options as a hash reference:
my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});
WTF! That fixes it! I don't understand why it seems to be ignoring the
encoding set in the constructor. But I've noticed the same thing with other
options. Seems like there's some consistency to be worked out in XML::LibXML
options, still.
Okay, a bit more information: this was not quite it, alas.
In order to print Unicode text strings (as opposed to octet strings)
correctly to a terminal (UTF-8 or not), add the following line before
the first output:
binmode STDOUT, ':utf8';
But note that STDOUT is global.
Yes, I do this all the time. Surprisingly, I don't get warnings for this
script, even though it is outputting multibyte characters.
This is key. If I set the binmode on STDOUT to :utf8, the bogus characters
print out bogus. If I set it to :raw, they come out right after processing by
both Encode and XML::LibXML (I'm assuming they're interpreted as latin-1).
So my question is this: Why isn't Encode dying when it runs into these
characters? They're not valid utf-8, AFAICT. Are they somehow valid utf8 (that
is, valid in Perl's internal format)? Why would they be?
I think what I need is some code to strip non-utf8 characters from a string --
even if that string has the utf8 bit switched on. I thought that Encode would
do that for me, but in this case apparently not. Anyone got an example?
Thanks,
David