Re: utf8, japanese, web-pages, the horror, the horror...
2004-05-11 01:30:06
Marco Baroni <baroni(_at_)sslmit(_dot_)unibo(_dot_)it> writes:
Thanks for your advice... the output does look different, this time,
but it still doesn't look like utf8... (I get the same error with
recode).
If somebody could suggest a way to convert to another encoding, or a
better way to identify the encoding of eac page, that would also be
fine (once I have control over the encodings, I think I can find some
way to convert back to utf8 (eg, via recode).
In my opinion Encode's from_to isn't a natural interface.
(With from_to neither the original nor the result is in a form
in which you can use perl's character semantics.)
It is much better IMHO to use ->decode directly.
That is use 'decode' to convert (based on 'charset=' in this case)
whatever encoding source is in to Unicode. Then write Unicode using
binmode :utf8 or :encoding() of your choice.
If you must use from_to() then appropriate target for a :utf8 stream
is to get characters into internal Unicode form:
from_to($text, $charset, 'Unicode')
I would prefer to use
binmode STDOOUT,":utf8";
my $encoding = find_encoding($charset);
my $unicode = $encoding->decode($text);
print $unicode;
Thanks again,
Marco
On Saturday, May 8, 2004, at 05:16 Europe/Rome, Edward Batutis wrote:
Marco:
I think you are converting twice:
# output will be utf8
binmode(STDOUT, ":utf8");
...
from_to($html_text,$charset,"utf8");
...
Here, it will convert html_text to utf-8 again because of binmode with
utf-8:
print "CURRENT URL $url\n$html_text\n";
I think you can just remove the binmode line and it will work.
Why do encodings always cause so much pain?
I hope this helps today's pain, at least :-).
Regards,
=Ed
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- utf8, japanese, web-pages, the horror, the horror..., Marco Baroni
- RE: utf8, japanese, web-pages, the horror, the horror..., Edward Batutis
- Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni
- Re: utf8, japanese, web-pages, the horror, the horror...,
Nick Ing-Simmons <=
- Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni
- utf8, japanese, web-pages: beginning to see the light..., Marco Baroni
- Re: utf8, japanese, web-pages: beginning to see the light..., Nick Ing-Simmons
|
Previous by Date: |
Re: BOM and principle of least surprise, Nick Ing-Simmons |
Next by Date: |
Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni |
Previous by Thread: |
Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni |
Next by Thread: |
Re: utf8, japanese, web-pages, the horror, the horror..., Marco Baroni |
Indexes: |
[Date]
[Thread]
[Top]
[All Lists] |
|
|