Re: Variation In Decoding Between Encode and XML::LibXML

On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:

So it may be valid UTF-8, but why does it come out looking like crap? That 
is, "LaurinaviÃ≥ÂŸius"? I suppose there's an > argument that 
"LaurinaviÄŸius" is correct and valid, if ugly. Maybe?


I am unsure if this is the explanation you are looking for but here goes:

I think the original data contained the character \x{010d}. In utf-8, that 
means that it should be represented as the bytes \x{c4} and \x{8d}. If those 
bytes are not marked as in fact being a two-byte utf-8 encoding of a single 
character, or if an application reading the data mistakenly thinks it is not 
encoded (both common errors), somewhere along the transmission an application 
may decide that it needs to re-encode the characters in utf-8. 

So the original character \x{010d} is represented by the bytes \x{c4} and 
\x{8d}, an application thinks those are in fact characters and encodes them 
again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe 
is your broken data.


I see. That makes sense. FYI, the original source is at:

  
http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22

Look for "Tomas" in the output. If it doesn't show pu, change max=50 to max=75 
or something.

I think the error comes from Perl's handling of utf-8 data and that this 
handling has changed in subtle ways all the way since Perl 5.6. We have 
supported utf-8 in our applications since Perl 5.6 and have experienced this 
repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of 
DBD::ODBC Martin Evans provided have given us a lot of work trying to sort 
out these troubles.


Maintaining the backwards compatibility from the pre-utf8 days must make it far 
more difficult than it otherwise would be.

I wonder if your code would work fine in Perl 5.8? We are "only" at 5.10(.1) 
but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works 
fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML?


In my application, I finally got XML::LibXML to choke on the invalid 
characters, and then found that the problem was that I was running 
Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML. 
Once I removed that, it stopped choking. So clearly zap_cp1252 was changing 
bytes it should not have. I now have it running fix_cp1252 *after* the parsing, 
when everything is already UTF-8. Now that I think about it, though, I should 
probably change it so that it searches on characters instead of bytes when 
working on a utf8 string. Will have to look into that.

In the meantime, I'll just accept that sometimes the characters are valid UTF-8 
and look like shit. Frankly, when I run the above feed through NetNewsWire, the 
offending byte sequence displays as "Ä", just as it does in my app's output. So 
I blame Yahoo.

Thanks for the detailed explanation, Henning, much appreciated.

Best,

David