Encode hackers (Especially Autrijius)
I am now fairly content with the feature set of Encode so I decided to
write some programs based upon it.
And I have found that most of Chinese (Continental; seems like
Taiwanese are much more technically correct) and Korean mails and web
pages confuse "charset" and "encodings". That is, charset="gb2312"
really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr.
Sadly this misconception is enbedded to popular browsers.
So when you try something like
my ($encname) = /^Content-Type:.*charset=[\"\']?([0-9A-Za-z_-]+)/o;
....
my $utf8 = encode($encname, $string);
You are in big trouble. Aliases is no salvation because most web
pages in *.cn happily includes
<META http-equiv="Content-Type" content="text/html; charset=gb2312">
It seems to them it is taken for granted that encoding is simply a
charset encoded in EUC. Anton has wistfully states this in
Encode::Supported but I didn't realize the depth of problem until I put
Encode from in vitro to in vivo (that is, out of lab and into real
world).
So I propose to;
* rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw
* and alias gb2312 and ksc5601 to euc-(cn|kr)
I know it's technically wrong but perl opts more for practical than
technical....
Dan the Man with Too Many SPAMs form CN and KR