perl-unicode

utf8, japanese, web-pages, the horror, the horror...

2004-05-07 06:30:05
Dear perl-unicode gurus,

I've been struggling for days with this but I still can't see the light at
the end of the byte/character tunnel... Any advice will be greatly
appreciated.

I have a script that takes a list of urls, retrieves the corresponding
web-pages and prints their contents.

I would like to adapt it to Japanese.

Input is in utf8, output can be in utf8 or any other encoding, as long as
I know in which encoding it is.

Of course, the Japanese webpages can be in one of many encodings, so I
thought I strip the encoding off the html (/charset=([^\"]+)\"/) and 
recode everything to utf8 using from_to.

Here is a simplified version of the script with what I believe to be the 
relevant parts:

******************************************************
******************************************************
#!/usr/bin/perl

# print_pages_from_url_list.jp.pl

use strict;
use warnings;
use LWP;
use Encode 'from_to';

# output will be utf8
binmode(STDOUT, ":utf8");

my $browser;
my $html_text;

my $ifile = shift;
# input will be utf8
open my $in, "<:encoding(utf8)", $ifile or die;
while (<$in>) {

    # input file is in one-url-per-line format

    # just in case there is some ftp left in the list
    if ($_ !~/^http/) {
        next;
    }

    my ($url) = $_;

    chomp $url;

    # avoid ps, pdf, word and the like
    if ($url 
!~/\.(ps)|(gz)|(pdf)|(gif)|(jpg)|(jpeg)|(doc)|(xls)|(ppt)|(rtf)$/i) {
        # retrieve web pages
        if ($html_text = do_GET($url)) {
            if ($html_text =~ /charset=([^\"]+)\"/) {
                # find out charset and convert to utf8
                my $charset = $1;
                from_to($html_text,$charset,"utf8");
                # here, I used to send $htm_text to HTML::TreeBuilder
                # for debugging purposes now I'm just stripping off
                # tags
                $html_text =~ s/<[^>]*>//g;
                print "CURRENT URL $url\n$html_text\n";
            }
            else {
                # charset was not specified, we better leave the
                # page alone
                next;
            }
        }
    }
}

sub do_GET {
    # this is taken from the perl & lwp book (but I changed it a bit)

    $browser = LWP::UserAgent->new() unless $browser;
    $browser->timeout(10);
    $browser->env_proxy();

    my $response;

    eval {$response = $browser->get(@_);};

    if ($@) {
        print STDERR "something went wrong: $(_at_)\n";
        return;
    }

    return unless $response->is_success;

    return $response->content;
}

******************************************************
******************************************************

In this version, it runs, it doesn't complain, but the output doesn't look
like utf8.

For example, I canot visualize it with more (which on the computer I'm
using works fine with other utf8 files), and if I try the following:

$ recode utf8..euc-jp <newpages

I immediately get the error:

Invalid input in step `UTF-8..EUC-JP'

What am I doing wrong?

Why do encodings always cause so much pain?

Arigato!!!!

Marco


--
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni