perl-unicode

Re: [Encode] Encoding vs. Charset

2002-03-25 18:59:32
On Tue, Mar 26, 2002 at 09:07:25AM +0900, Dan Kogai wrote:
Encode hackers (Especially Autrijius)

   I am now fairly content with the feature set of Encode so I decided to 
write some programs based upon it.
   And I have found that most of Chinese (Continental; seems like 
Taiwanese are much more technically correct) and Korean mails and web 
pages confuse "charset" and "encodings".  That is, charset="gb2312" 
really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr.  
Sadly this misconception is enbedded to popular browsers.
   So when you try something like

   my ($encname) = /^Content-Type:.*charset=[\"\']?([0-9A-Za-z_-]+)/o;
   ....
   my $utf8 = encode($encname, $string);

   You are in big trouble.  Aliases is no salvation because most web 
pages in *.cn happily includes

   <META http-equiv="Content-Type" content="text/html; charset=gb2312">

   It seems to them it is taken for granted that encoding is simply a 
charset encoded in EUC.  Anton has wistfully states this in 
Encode::Supported but I didn't realize the depth of problem until I put 
Encode from in vitro to in vivo (that is, out of lab and into real 
world).
   So I propose to;

* rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw

-raw sounds funny, as if it were somehow "unprocessed" version.
How about -strict?

* and alias gb2312 and ksc5601 to euc-(cn|kr)

   I know it's technically wrong but perl opts more for practical than 
technical....

Dan the Man with Too Many SPAMs form CN and KR


-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>