perl-unicode

[Encode] Encoding vs. Charset

2002-03-25 17:07:50
Encode hackers (Especially Autrijius)

I am now fairly content with the feature set of Encode so I decided to write some programs based upon it. And I have found that most of Chinese (Continental; seems like Taiwanese are much more technically correct) and Korean mails and web pages confuse "charset" and "encodings". That is, charset="gb2312" really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr. Sadly this misconception is enbedded to popular browsers.
  So when you try something like

  my ($encname) = /^Content-Type:.*charset=[\"\']?([0-9A-Za-z_-]+)/o;
  ....
  my $utf8 = encode($encname, $string);

You are in big trouble. Aliases is no salvation because most web pages in *.cn happily includes

  <META http-equiv="Content-Type" content="text/html; charset=gb2312">

It seems to them it is taken for granted that encoding is simply a charset encoded in EUC. Anton has wistfully states this in Encode::Supported but I didn't realize the depth of problem until I put Encode from in vitro to in vivo (that is, out of lab and into real world).
  So I propose to;

* rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw
* and alias gb2312 and ksc5601 to euc-(cn|kr)

I know it's technically wrong but perl opts more for practical than technical....

Dan the Man with Too Many SPAMs form CN and KR


<Prev in Thread] Current Thread [Next in Thread>