Re: Need some help in understanding Unicode in Perl...

Sei Heng Ang <lrepdummy(_at_)yahoo(_dot_)com> writes:

2. I am currently writing a mail filter to extract
message body from the email. However, even though the
email charset has been defined as (eg) gb2312, it
still contain standard ASCII characters. Is there a
way where I can sort of convert the entire string into
unicode? Is there a module (library) that can
automatically recognize the individual characters in
the string and convert them accordingly?


Good Luck! - e-mail charset specifications are very patchy.
Many mail clients "lie" about the content. Microsoft clients
in particular use names of standard encodings when the mail
contains a different encoding. e.g. they widely claim iso-8859-1
when they mean the microsoft code page which closely related but 
assigns values to 0x80..0x9F when ISO does not.


The Encode::CN module has this to say about gb2312:

"
When you see C<charset=gb2312> on mails and web pages, they really
mean C<euc-cn> encodings.  To fix that, C<gb2312> is aliased to C<euc-cn>.
Use C<gb2312-raw> when you really mean it.

The ASCII region (0x00-0x7f) is preserved for all encodings, even though
this conflicts with mappings by the Unicode Consortium."

I have had some success displaying email using Encode::'s euc-cn 
and Unicode fonts, but as I can't read many chineese characters 
this was mainly just as an exercise.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/