Re: utf8, japanese, web-pages, the horror, the horror...

Marco Baroni <baroni(_at_)sslmit(_dot_)unibo(_dot_)it> writes:

Thanks for your advice... the output does look different, this time, 
but it still doesn't look like utf8... (I get the same error with 
recode).


If somebody could suggest a way to convert to another encoding, or a 
better way to identify the encoding of eac page, that would also be 
fine (once I have control over the encodings, I think I can find some 
way to convert back to utf8 (eg, via recode).


In my opinion Encode's from_to isn't a natural interface.
(With from_to neither the original nor the result is in a form 
in which you can use perl's character semantics.)

It is much better IMHO to use ->decode directly.

That is use 'decode' to convert (based on 'charset=' in this case) 
whatever encoding source is in to Unicode. Then write Unicode using 
binmode :utf8 or :encoding() of your choice.

If you must use from_to() then appropriate target for a :utf8 stream
is to get characters into internal Unicode form:

   from_to($text, $charset, 'Unicode') 

I would prefer to use 

   binmode STDOOUT,":utf8";
   my $encoding = find_encoding($charset);
   my $unicode = $encoding->decode($text);
   print $unicode;


Thanks again,

Marco

On Saturday, May 8, 2004, at 05:16 Europe/Rome, Edward Batutis wrote:

Marco:

I think you are converting twice:

# output will be utf8
binmode(STDOUT, ":utf8");
...
                from_to($html_text,$charset,"utf8");
...


Here, it will convert html_text to utf-8 again because of binmode with
utf-8:

                print "CURRENT URL $url\n$html_text\n";


I think you can just remove the binmode line and it will work.

Why do encodings always cause so much pain?


I hope this helps today's pain, at least :-).

Regards,

=Ed