perl-unicode

Re: Does LWP know anything (or need to know anything) about Unicode?

2004-10-11 02:30:07
LWP I believe just ships octets about.

But it should have a mechanism to tell you the meta-data that
HTTP marked those octets with - in this case there should
be something like a content-transfer-encoding header that
tells _you_ what name to feed to Encode to get bytes as Unicode.
You then have to decide how you are going to present the resulting
characters in the HTML you are generating. You probably want
to re-encode as UTF-8 if presenting mixed languages.

Some time ago, following in part Nick's advice, I did a script that retrieves pages and, if it recognizes that they are in a Japanese encoding, converts them to unicode and formats them.

I paste it below.

Perhaps, you can recycle the relevant parts for your script.

Regards,

Marco


**************************************************
print_pages_from_url_list.jp.pl
**************************************************

#!/usr/bin/perl

# to see the documentation, type:
# print_pages_from_url_list.jp.pl -h

use strict;
use warnings;

use Getopt::Std;

use LWP;
use HTML::TreeBuilder;

use Encode;

binmode(STDOUT, ":utf8");

my $usage = <<"_USAGE_";
Usage:

    print_pages_from_url_list.jp.pl url_list
    print_pages_from_url_list.jp.pl -h | more

 This script downloads all the html in the pages corresponding to an
 input url list, and it formats their contents as text.

 The input list must have one url per line. Lines that do not begin
 with http are ignored.

 Urls with obviously non-html suffixes (doc, jpg, pdf, etc.)  are also
 ignored.

 For the remaining urls, the script looks at the charset in the html
 code.

 If charset is recognized as a Japanese charset (via regexps that
 recognize some variations of all encodings), the script converts the
 corresponding page to utf8 and it prints out the corresponding page
 (preceeded by line "CURRENT URL url").

 In future versions, I would like to print warnings about pages that are
not recognized, and to do better formatting of the pages inside the script.

 This program is free software. You may copy or redistribute it under
 the same terms as Perl itself.

_USAGE_

my %opts = ();
getopts('h',\%opts);

if ($opts{h}) {
    print $usage;
    exit;
}


my $browser;
my $html_text;

my $fileflag = "NA: no charset specified";
my $charset = "NA";

while (<>) {

    if ($_ !~/^http/) {
        next;
    }

    my ($url) = $_;
    chomp $url;
if ($url !~/\.(ps)|(gz)|(pdf)|(gif)|(jpg)|(jpeg)|(doc)|(xls)|(ppt)|(rtf)$/i) {
        if ($html_text = do_GET($url)) {
            if ($html_text =~ /charset=([^\"]+)\"/i) {
                $charset = $1;
                if ($charset =~/euc.*jp/i) {
                    $fileflag = "euc-jp";
                }
                elsif ($charset =~/.*s.*jis/i) {
                    $fileflag = "shiftjis";
                }
                elsif ($charset =~/utf.*8/i) {
                    $fileflag = "utf8";
                }
                elsif ($charset =~/iso-2022-jp/) {
                    $fileflag = "iso-2022-jp";
                }
                else {
                    $fileflag = "NA: unknown charset $charset";
                }
            }
        
            if ($fileflag =~/^NA/) {
                next;
            }

            my $encoding = find_encoding($fileflag);
            my $unicode = $encoding->decode($html_text);

            my $tree = HTML::TreeBuilder->new;
            $tree->parse(encode_utf8($unicode));

            $| = 1;

            print "CURRENT URL $url\n";
        
            print "\n";

            print decode("utf8",$tree->as_text);
            print "\n";
        
        }
    }
}

sub do_GET {
    # this is taken from the perl & lwp book (but I changed it a bit)

    $browser = LWP::UserAgent->new() unless $browser;
    $browser->timeout(10);
    $browser->env_proxy();   

    my $response;

    # I've put the following in an eval block to try to survive
    # failed ntlm authentications... let's hope it doesn't
    # cause other kinds of trouble...

    eval {$response = $browser->get(@_);};

    if ($@) {
        print STDERR "something went wrong: $(_at_)\n";
        return;
    }

    return unless $response->is_success;

    return $response->content;
}

<Prev in Thread] Current Thread [Next in Thread>