perl-unicode

Re: utf8, japanese, web-pages, the horror, the horror...

2004-05-11 01:30:06
Thanks!

I will try the solution you propose, and I will let y'all know whether it 
works.

In the meantime, I had ``solved'' the problem by saving pages with
different charset=... declarations to different output files (ofile.sjis,
ofile.euc, etc.), and then using recode to convert everything to the same
charset.

Unfortunately, this (moving the encoding processing outside perl)  seems
to be what I always end up doing, when I have to deal with characters
outside the latin1 range...

As you said, from_to isn't a natural interface, at least for me!

Regards,

Marco
 
In my opinion Encode's from_to isn't a natural interface.
(With from_to neither the original nor the result is in a form 
in which you can use perl's character semantics.)

It is much better IMHO to use ->decode directly.

That is use 'decode' to convert (based on 'charset=' in this case) 
whatever encoding source is in to Unicode. Then write Unicode using 
binmode :utf8 or :encoding() of your choice.

If you must use from_to() then appropriate target for a :utf8 stream
is to get characters into internal Unicode form:

   from_to($text, $charset, 'Unicode') 

I would prefer to use 

   binmode STDOOUT,":utf8";
   my $encoding = find_encoding($charset);
   my $unicode = $encoding->decode($text);
   print $unicode;


--
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni