Re: [Encode] Encoding vs. Charset

Hello Dan!

DK> ... I have found that most of Chinese (Continental; seems like
DK> Taiwanese are much more technically correct) and Korean mails and web 
DK> pages confuse "charset" and "encodings".

I'm fixing a small article on that right now (maybe you have already
read the first edition, but I'm rewriting 75% of it and just because
of the "charset" vs "encoding" terminology. It will probably be ready
by 26 march 23:00 GMT, I'll post a message to perl-unicode since
there's interest in terminology!


DK> That is, charset="gb2312" 
DK> really means euc-cn and

In the defense of continental Chinese I must say that it's okay:
the IANA registry (http://www.iana.org/assignments/character-sets)
has

Name: GB2312  (preferred MIME name)
MIBenum: 2025
Source: Chinese for People's Republic of China (PRC) mixed one byte, 
        two byte set: 
          20-7E = one byte ASCII 
          A1-FE = two byte PRC Kanji 
        See GB 2312-80 
        PCL Symbol Set Id: 18C
Alias: csGB2312

this looks pretty much like EUC-CN (or CN-GB what Autrijus has
confirmed as an alias to EUC-CN)


DK> charset="ks_c_5601-1987" really menas euc-kr.

Here I 150% agree: IANA registry really has

Name: KS_C_5601-1987                                    [RFC1345,KXS2]
MIBenum: 36
Alias: iso-ir-149
Alias: KS_C_5601-1989
Alias: KSC_5601
Alias: korean
Alias: csKSC56011987

and RFC 1345 really has
  &charset KS_C_5601-1987
  &alias iso-ir-149
  &alias KS_C_5601-1989
  &alias KSC_5601
but this looks to me a 94x94-character table, rather then EUC-KR.
I observe this with a real sorrow as people have done a wrong thing.
If only they would use 'KS5601' (like GB2312) -- then it wouldn't have
clashed with IANA registration and with RFC 1345 :,-(((


FYI: (Interesting detail) Ken Lunde in his cjk.inf
(http://www.oreilly.com/people/authors/lunde/cjk_inf.html)
says that KS_C_5601-1987, KS_C_5601-1989 and 1992 year version
of this standard are the same speaking about characters and
there codepoints.

DK> Sadly this misconception is enbedded to popular browsers.
DK>    So when you try something like

DK>    my ($encname) = /^Content-Type:.*charset=[\"\']?([0-9A-Za-z_-]+)/o;
DK>    ....
DK>    my $utf8 = encode($encname, $string);

DK>    You are in big trouble.  Aliases is no salvation because most web 
DK> pages in *.cn happily includes

DK>    <META http-equiv="Content-Type" content="text/html; charset=gb2312">

Yup.. It's a big problem if people do not send a correct charset in
their Content-Type. The META is so much less handy to catch!

DK> ... Anton has wistfully
:-)
DK> stated this in Encode::Supported

I guess it may be removed now and GB2312 be listed as a first-class
preferred MIME name :-)
Will send a patch in another 12 hours after syncing and finishing that
article on "charset" vs "encoding" if you do not mind and if
nobody patches it before that!

DK> * rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw
DK> * and alias gb2312 and ksc5601 to euc-(cn|kr)

I'm very glad that the issue has been finally resolved!
8*)

DK>    I know it's technically wrong
For GB2312 its ok.
It is ok even for ksc5601.

It _VERY_ wrong for

ks_c_5601-1987

It is very-very wrong.. :,-(, but if they _do use_ it
as content-type's charset and mean EUC-KR, ah! we seem
to have to do a wrong favour to ks_c_5601-1987 :-(

Please do tell me again so that I would really go upset
is it really ks_c_5601-1987, not ks5601?
Is it really EUC-KR (8-bit)?

DK> but perl opts more for practical than 
DK> technical....
The show must go on!

- Anton, really upset by the Koreans' (mis)behavior

P.S. Maybe put a BIG poster in the Supported.pod or
somewhere nearby that we _badly_ need a Korean volunteer for
testing and advising?