LWP I believe just ships octets about.
But it should have a mechanism to tell you the meta-data that
HTTP marked those octets with - in this case there should
be something like a content-transfer-encoding header that
tells _you_ what name to feed to Encode to get bytes as Unicode.
You then have to decide how you are going to present the resulting
characters in the HTML you are generating. You probably want
to re-encode as UTF-8 if presenting mixed languages.
Some time ago, following in part Nick's advice, I did a script that
retrieves pages and, if it recognizes that they are in a Japanese
encoding, converts them to unicode and formats them.
I paste it below.
Perhaps, you can recycle the relevant parts for your script.
Regards,
Marco
**************************************************
print_pages_from_url_list.jp.pl
**************************************************
#!/usr/bin/perl
# to see the documentation, type:
# print_pages_from_url_list.jp.pl -h
use strict;
use warnings;
use Getopt::Std;
use LWP;
use HTML::TreeBuilder;
use Encode;
binmode(STDOUT, ":utf8");
my $usage = <<"_USAGE_";
Usage:
print_pages_from_url_list.jp.pl url_list
print_pages_from_url_list.jp.pl -h | more
This script downloads all the html in the pages corresponding to an
input url list, and it formats their contents as text.
The input list must have one url per line. Lines that do not begin
with http are ignored.
Urls with obviously non-html suffixes (doc, jpg, pdf, etc.) are also
ignored.
For the remaining urls, the script looks at the charset in the html
code.
If charset is recognized as a Japanese charset (via regexps that
recognize some variations of all encodings), the script converts the
corresponding page to utf8 and it prints out the corresponding page
(preceeded by line "CURRENT URL url").
In future versions, I would like to print warnings about pages that are
not recognized, and to do better formatting of the pages inside the
script.
This program is free software. You may copy or redistribute it under
the same terms as Perl itself.
_USAGE_
my %opts = ();
getopts('h',\%opts);
if ($opts{h}) {
print $usage;
exit;
}
my $browser;
my $html_text;
my $fileflag = "NA: no charset specified";
my $charset = "NA";
while (<>) {
if ($_ !~/^http/) {
next;
}
my ($url) = $_;
chomp $url;
if ($url
!~/\.(ps)|(gz)|(pdf)|(gif)|(jpg)|(jpeg)|(doc)|(xls)|(ppt)|(rtf)$/i) {
if ($html_text = do_GET($url)) {
if ($html_text =~ /charset=([^\"]+)\"/i) {
$charset = $1;
if ($charset =~/euc.*jp/i) {
$fileflag = "euc-jp";
}
elsif ($charset =~/.*s.*jis/i) {
$fileflag = "shiftjis";
}
elsif ($charset =~/utf.*8/i) {
$fileflag = "utf8";
}
elsif ($charset =~/iso-2022-jp/) {
$fileflag = "iso-2022-jp";
}
else {
$fileflag = "NA: unknown charset $charset";
}
}
if ($fileflag =~/^NA/) {
next;
}
my $encoding = find_encoding($fileflag);
my $unicode = $encoding->decode($html_text);
my $tree = HTML::TreeBuilder->new;
$tree->parse(encode_utf8($unicode));
$| = 1;
print "CURRENT URL $url\n";
print "\n";
print decode("utf8",$tree->as_text);
print "\n";
}
}
}
sub do_GET {
# this is taken from the perl & lwp book (but I changed it a bit)
$browser = LWP::UserAgent->new() unless $browser;
$browser->timeout(10);
$browser->env_proxy();
my $response;
# I've put the following in an eval block to try to survive
# failed ntlm authentications... let's hope it doesn't
# cause other kinds of trouble...
eval {$response = $browser->get(@_);};
if ($@) {
print STDERR "something went wrong: $(_at_)\n";
return;
}
return unless $response->is_success;
return $response->content;
}