Re: Variation In Decoding Between Encode and XML::LibXML

At 13:24 -0700 17/6/10, David E. Wheeler wrote:

On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:

So the original character \x{010d} is represented by the bytes\x{c4} and \x{8d}, an application thinks those are in factcharacters and encodes them again as \x{c3} + \x{84} and \x{c2} +\x{8d}, respectively. Which I believe is your broken data.
I see. That makes sense. FYI, the original source is at:


http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22

In the meantime, I'll just accept that sometimes the characters arevalid UTF-8 and look like shit. Frankly, when I run the above feedthrough NetNewsWire, the offending byte sequence displays as "Ä",just as it does in my app's output. So I blame Yahoo.

Quite right. Now I see the file it is clear that the encoding hasbeen done twice, each of the two bytes for the c-with-caron beingagain encoded to produce four bytes.


If I save the file and undo the second decoding I get the proper output


#!/usr/bin/perl
use strict;
use Encode;
no warnings;
my $f = "$ENV{HOME}/desktop/pipe.run";
open F, $f;
while (<F>){
        print decode("utf-8", $_)
}



JD

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: Variation In Decoding Between Encode and XML::LibXML, Marvin Humphrey

Next by Date:

Re: Variation In Decoding Between Encode and XML::LibXML, John Delacour

Previous by Thread:

Re: Variation In Decoding Between Encode and XML::LibXML, David E. Wheeler

Next by Thread:

Re: Variation In Decoding Between Encode and XML::LibXML, John Delacour

Indexes:

[Date] [Thread] [Top] [All Lists]