Sei Heng Ang <lrepdummy(_at_)yahoo(_dot_)com> writes:
2. I am currently writing a mail filter to extract
message body from the email. However, even though the
email charset has been defined as (eg) gb2312, it
still contain standard ASCII characters. Is there a
way where I can sort of convert the entire string into
unicode? Is there a module (library) that can
automatically recognize the individual characters in
the string and convert them accordingly?
Good Luck! - e-mail charset specifications are very patchy.
Many mail clients "lie" about the content. Microsoft clients
in particular use names of standard encodings when the mail
contains a different encoding. e.g. they widely claim iso-8859-1
when they mean the microsoft code page which closely related but
assigns values to 0x80..0x9F when ISO does not.
The Encode::CN module has this to say about gb2312:
"
When you see C<charset=gb2312> on mails and web pages, they really
mean C<euc-cn> encodings. To fix that, C<gb2312> is aliased to C<euc-cn>.
Use C<gb2312-raw> when you really mean it.
The ASCII region (0x00-0x7f) is preserved for all encodings, even though
this conflicts with mappings by the Unicode Consortium."
I have had some success displaying email using Encode::'s euc-cn
and Unicode fonts, but as I can't read many chineese characters
this was mainly just as an exercise.
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/