perl-unicode

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 18:28:27
At 13:24 -0700 17/6/10, David E. Wheeler wrote:

On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:


So the original character \x{010d} is represented by the bytes \x{c4} and \x{8d}, an application thinks those are in fact characters and encodes them again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe is your broken data.

I see. That makes sense. FYI, the original source is at:


http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22



In the meantime, I'll just accept that sometimes the characters are valid UTF-8 and look like shit. Frankly, when I run the above feed through NetNewsWire, the offending byte sequence displays as "Ä", just as it does in my app's output. So I blame Yahoo.


Quite right. Now I see the file it is clear that the encoding has been done twice, each of the two bytes for the c-with-caron being again encoded to produce four bytes.

If I save the file and undo the second decoding I get the proper output


#!/usr/bin/perl
use strict;
use Encode;
no warnings;
my $f = "$ENV{HOME}/desktop/pipe.run";
open F, $f;
while (<F>){
        print decode("utf-8", $_)
}



JD

<Prev in Thread] Current Thread [Next in Thread>