Re: utf8, japanese, web-pages: beginning to see the light...

Marco Baroni <baroni(_at_)sslmit(_dot_)unibo(_dot_)it> writes:

A few days ago, I queried this list about my problems with a script 
that finds the charset of Japanese web pages and translates their text 
into utf-8.

The following solution, proposed by Nick Ing-Simmons, worked for my 
case:

   binmode STDOOUT,":utf8";
   my $encoding = find_encoding($charset);
   my $unicode = $encoding->decode($text);


      Run HTML::FormatText here with chars in Unicode.

   print $unicode;

($charset is the charset as extracted from the html code of the page 
and $text is all the text from the page itself, as returned by the LWP 
agent.)

Thanks a lot to Nick and to all the others who responded to my plea for 
help.

Now for a much less pressing issue: Does anybody know of something 
similar to the HTML::FormatText module that can take utf-8 input, and 
generate utf-8 output?


Doubt it. But if you run it on Unicode chars (as indicated above)
then unless it is doing something too clever it should just work.

In other words, of a module or command line tool 
to which I could feed my Japanese html pages, or html documents in 
other non-Latin alphabets, and get nicely formatted plain utf-8 text as 
output?
 (HTML::FormatText seems to break with utf-8 and with the Japanese 
encodings.)

Thanks in advance.

Regards,

Marco


---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni