At 13:24 -0700 17/6/10, David E. Wheeler wrote:
On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:
So the original character \x{010d} is represented by the bytes
\x{c4} and \x{8d}, an application thinks those are in fact
characters and encodes them again as \x{c3} + \x{84} and \x{c2} +
\x{8d}, respectively. Which I believe is your broken data.
I see. That makes sense. FYI, the original source is at:
http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22
In the meantime, I'll just accept that sometimes the characters are
valid UTF-8 and look like shit. Frankly, when I run the above feed
through NetNewsWire, the offending byte sequence displays as "Ä",
just as it does in my app's output. So I blame Yahoo.
Quite right. Now I see the file it is clear that the encoding has
been done twice, each of the two bytes for the c-with-caron being
again encoded to produce four bytes.
If I save the file and undo the second decoding I get the proper output
#!/usr/bin/perl
use strict;
use Encode;
no warnings;
my $f = "$ENV{HOME}/desktop/pipe.run";
open F, $f;
while (<F>){
print decode("utf-8", $_)
}
JD