
Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 02:05:56
At 00:27 +0100 18/6/10, I wrote:

If I save the file and undo the second decoding I get the proper output

In this case all talk of iso-8859-1 and cp1252 is a red herring. I read several Italian websites where this same problem is manifest in external material such as ads. The news page proper is encoded properly and declared as utf-8 but I imagine the web designers have reckoned that the stuff they receive from the advertisers is most likely to be received as windows-1252 and convert accordingly rather than bother to verify the encoding. As a result material that is received as utf-8 will undergo a superfluous encoding.

Here's a way to get the file in question properly encoded:

use strict;
use LWP::Simple;
use Encode;
no warnings; # avoid wide character warning
my $tempdir = "/tmp";
my $tempfile = "tempfile";
my $f = "$tempdir/$tempfile";
my $uri="";.
if (getstore($uri, $f)){
  open F, $f or die $!;
  while (<F>){
    my $encoding = find_encoding("utf-8");
    my $utf8 = $encoding->decode($_);
    print $utf8;
  close F;
unlink $f;