
Re: utf8, japanese, web-pages, the horror, the horror...

2004-05-11 01:30:06

I will try the solution you propose, and I will let y'all know whether it 

In the meantime, I had ``solved'' the problem by saving pages with
different charset=... declarations to different output files (ofile.sjis,
ofile.euc, etc.), and then using recode to convert everything to the same

Unfortunately, this (moving the encoding processing outside perl)  seems
to be what I always end up doing, when I have to deal with characters
outside the latin1 range...

As you said, from_to isn't a natural interface, at least for me!


In my opinion Encode's from_to isn't a natural interface.
(With from_to neither the original nor the result is in a form 
in which you can use perl's character semantics.)

It is much better IMHO to use ->decode directly.

That is use 'decode' to convert (based on 'charset=' in this case) 
whatever encoding source is in to Unicode. Then write Unicode using 
binmode :utf8 or :encoding() of your choice.

If you must use from_to() then appropriate target for a :utf8 stream
is to get characters into internal Unicode form:

   from_to($text, $charset, 'Unicode') 

I would prefer to use 

   binmode STDOOUT,":utf8";
   my $encoding = find_encoding($charset);
   my $unicode = $encoding->decode($text);
   print $unicode;

Marco Baroni
SSLMIT, University of Bologna