utf8, japanese, web-pages: beginning to see the light...

A few days ago, I queried this list about my problems with a scriptthat finds the charset of Japanese web pages and translates their textinto utf-8.

The following solution, proposed by Nick Ing-Simmons, worked for mycase:

   binmode STDOOUT,":utf8";
   my $encoding = find_encoding($charset);
   my $unicode = $encoding->decode($text);
   print $unicode;

($charset is the charset as extracted from the html code of the pageand $text is all the text from the page itself, as returned by the LWPagent.)

Thanks a lot to Nick and to all the others who responded to my plea forhelp.

Now for a much less pressing issue: Does anybody know of somethingsimilar to the HTML::FormatText module that can take utf-8 input, andgenerate utf-8 output? In other words, of a module or command line toolto which I could feed my Japanese html pages, or html documents inother non-Latin alphabets, and get nicely formatted plain utf-8 text asoutput?(HTML::FormatText seems to break with utf-8 and with the Japaneseencodings.)


Thanks in advance.

Regards,

Marco


---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: BOM and principle of least surprise, Jarkko Hietaniemi

Next by Date:

Re: Printing Unicode from XS, Erland Sommarskog

Previous by Thread:

Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni

Next by Thread:

Re: utf8, japanese, web-pages: beginning to see the light..., Nick Ing-Simmons

Indexes:

[Date] [Thread] [Top] [All Lists]