utf8, japanese, web-pages, the horror, the horror...
2004-05-07 06:30:05
Dear perl-unicode gurus,
I've been struggling for days with this but I still can't see the light at
the end of the byte/character tunnel... Any advice will be greatly
appreciated.
I have a script that takes a list of urls, retrieves the corresponding
web-pages and prints their contents.
I would like to adapt it to Japanese.
Input is in utf8, output can be in utf8 or any other encoding, as long as
I know in which encoding it is.
Of course, the Japanese webpages can be in one of many encodings, so I
thought I strip the encoding off the html (/charset=([^\"]+)\"/) and
recode everything to utf8 using from_to.
Here is a simplified version of the script with what I believe to be the
relevant parts:
******************************************************
******************************************************
#!/usr/bin/perl
# print_pages_from_url_list.jp.pl
use strict;
use warnings;
use LWP;
use Encode 'from_to';
# output will be utf8
binmode(STDOUT, ":utf8");
my $browser;
my $html_text;
my $ifile = shift;
# input will be utf8
open my $in, "<:encoding(utf8)", $ifile or die;
while (<$in>) {
# input file is in one-url-per-line format
# just in case there is some ftp left in the list
if ($_ !~/^http/) {
next;
}
my ($url) = $_;
chomp $url;
# avoid ps, pdf, word and the like
if ($url
!~/\.(ps)|(gz)|(pdf)|(gif)|(jpg)|(jpeg)|(doc)|(xls)|(ppt)|(rtf)$/i) {
# retrieve web pages
if ($html_text = do_GET($url)) {
if ($html_text =~ /charset=([^\"]+)\"/) {
# find out charset and convert to utf8
my $charset = $1;
from_to($html_text,$charset,"utf8");
# here, I used to send $htm_text to HTML::TreeBuilder
# for debugging purposes now I'm just stripping off
# tags
$html_text =~ s/<[^>]*>//g;
print "CURRENT URL $url\n$html_text\n";
}
else {
# charset was not specified, we better leave the
# page alone
next;
}
}
}
}
sub do_GET {
# this is taken from the perl & lwp book (but I changed it a bit)
$browser = LWP::UserAgent->new() unless $browser;
$browser->timeout(10);
$browser->env_proxy();
my $response;
eval {$response = $browser->get(@_);};
if ($@) {
print STDERR "something went wrong: $(_at_)\n";
return;
}
return unless $response->is_success;
return $response->content;
}
******************************************************
******************************************************
In this version, it runs, it doesn't complain, but the output doesn't look
like utf8.
For example, I canot visualize it with more (which on the computer I'm
using works fine with other utf8 files), and if I try the following:
$ recode utf8..euc-jp <newpages
I immediately get the error:
Invalid input in step `UTF-8..EUC-JP'
What am I doing wrong?
Why do encodings always cause so much pain?
Arigato!!!!
Marco
--
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- utf8, japanese, web-pages, the horror, the horror...,
Marco Baroni <=
- RE: utf8, japanese, web-pages, the horror, the horror..., Edward Batutis
- Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni
- Re: utf8, japanese, web-pages, the horror, the horror..., Nick Ing-Simmons
- Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni
- utf8, japanese, web-pages: beginning to see the light..., Marco Baroni
- Re: utf8, japanese, web-pages: beginning to see the light..., Nick Ing-Simmons
|
Previous by Date: |
Re: BOM and principle of least surprise, Larry Wall |
Next by Date: |
RE: utf8, japanese, web-pages, the horror, the horror..., Edward Batutis |
Previous by Thread: |
Re: BOM and principle of least surprise, Paul Hoffman |
Next by Thread: |
RE: utf8, japanese, web-pages, the horror, the horror..., Edward Batutis |
Indexes: |
[Date]
[Thread]
[Top]
[All Lists] |
|
|