perl-unicode

RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 04:06:23
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW)

Which editor do you use? When loading the script in Komodo IDE 5.2 the string 
looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second 
line is correct - the first (no surprise) and third are broken.

Loading the file in UltraEdit-32 13.20+3, set to not convert the script on 
loading, it becomes obvious that what should have been one character is 
represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would 
probably show as 2 characters and as broken.

It looks to me like the string is being displayed as a byte representation of 
the characters, if that makes sense. My english isn't perfect :-/ and what I am 
trying to say is that this is problem that I am quite familiar with. It happens 
whenever the source and the reader do not agree on whether a string is encoded 
in utf-8 or not.

Apparently Encode fixes the incorrect string which is nice. The interesting 
thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably 
have to use Encode as a work-around for some time...


Best regards
Henning Michael Møller Just




-----Original Message-----
From: David E. Wheeler [mailto:david(_at_)kineticode(_dot_)com] 
Sent: Wednesday, June 16, 2010 7:56 AM
To: perl-unicode(_at_)perl(_dot_)org
Subject: Variation In Decoding Between Encode and XML::LibXML

Fellow Perlers,

I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that 
appears to mangle an originating Flickr feed. But the curious thing is, when I 
pull the offending string out of the RSS and just stick it in a script, Encode 
knows how to decode it properly, while XML::LibXML (and my Unicode-aware 
editors) cannot.

The attached script demonstrates. $str has the bogus-looking character". 
Encode, however, seems to properly convert it to the "č" in "Laurinavičius" in 
the output. XML::LibXML, OTOH, outputs it as "Laurinavičius" -- that is, 
broken. (If things look truly borked in this email too, please look at the 
attached script.)

So my question is, what gives? Is this truly a broken representation of the 
character and Encode just figures that out and fixes it? Or is there something 
off with my editor and with XML::LibXML.

FWIW, the character looks correct in my editor when I load it from the original 
Flickr feed. It's only after processing by Yahoo! Pipes that it comes out 
looking mangled.

Any insights would be appreciated.

Best,

David