perl-unicode

Re: GB2312 and EUC-CN : IANA registry

2002-03-26 21:08:00
On Wed, 27 Mar 2002, Anton Tagunov wrote:

 Hi, Anton,

Very glad to hear you on this list :-)

  Me, too :-)

When you say gb2312 and ksc5601, EUC-based encoding is assumed.

JS>   Please, don't help spread this misuse.

 Well, that was not meant to be applied to GB2312 :-). Below
is more extensive excerpt where I wrote that sentence:

JS>  Please, don't help spread this misuse. It might be all right
JS> for the ignorant) public to say KS C 5601 in place of EUC-KR, but Perl 
JS> programmers should learn the difference between KS C 5601/KS X 1001 (coded 
JS> character set) and encoding/MIME charset/character set encoding scheme/
JS> character coding. 

JS>   As I wrote before, GB 2312 has been so widely (mis)used that there's
JS> no way to replace it with EUC-CN. Korean situation is much better
JS> although not as good as Japanese case.

  It could have been misunderstood.....


Jungshik, one little point on GB2312.. Maybe I misunderstand
something, but

  No, you're absolutely right about IANA. See below.


IANA registry (http://www.iana.org/assignments/character-sets)
has

Name: GB2312  (preferred MIME name)
MIBenum: 2025
Source: Chinese for People's Republic of China (PRC) mixed one byte, 
        two byte set: 
          20-7E = one byte ASCII 
          A1-FE = two byte PRC Kanji 
        See GB 2312-80 
        PCL Symbol Set Id: 18C
Alias: csGB2312

do not know when was that put in, but it looks EUC-CN. Is it?
And if yes, then GB2312 is a perfectly valid charset, isn't it?

  Yes, it's EUC-CN. I was about to add that although
EUC-CN is a better name than GB2312, the former has never been registered
with IANA while the latter was as 'preferred MIME name, You got there
first :-).  It's unfortunate that PRC decided to do this way, but that's
what we got and I think we have to respect their decision.

And thank you for explaining how it happened that Korean
misuse the name of a CCS for charset :-)

  You're welcome :-)

Actually, I told you only half the story :-). The other half happened
before the widespread use of Internet in Korea (i.e late 1980's and
early 1990's) when people typically refered to what's now called EUC-KR
as 'KS C 5601 Wansung' (= US-ASCII in GL and KS C 5601 in GR). It was
not technically correct, but didn't do much harm because there's no
need for exchange of  data over the internet. EUC (Extended Unix Code:
it's not Extended Unix Character) for Korean  was first specified in KS
C 5861-1992 (now KS X 2901), but the name EUC-KR appeared first in RFC
1557 where ISO-2022-KR was defined. It would have been better if RFC
1557 had been more explicit in its description of EUC-KR so that IANA
entry for EUC-KR is patterned after that for EUC-JP(GB2312 -> EUC-CN)
with all the code sets and their  octet ranges. Perhaps, they 
thought just refering to KS C 5861-1992 was sufficient. 

----------
Name: EUC-KR  (preferred MIME name)                     [RFC1557,Choi]
MIBenum: 38
Source: RFC-1557 (see also KS_C_5861-1992)
Alias: csEUCKR

----------
Name: Extended_UNIX_Code_Packed_Format_for_Japanese
MIBenum: 18
Source: Standardized by OSF, UNIX International, and UNIX Systems
        Laboratories Pacific.  Uses ISO 2022 rules to select
               code set 0: US-ASCII (a single 7-bit byte set)
               code set 1: JIS X0208-1990 (a double 8-bit byte set)
                           restricted to A0-FF in both bytes
               code set 2: Half Width Katakana (a single 7-bit byte set)
                           requiring SS2 as the character prefix
               code set 3: JIS X0212-1990 (a double 7-bit byte set)
                           restricted to A0-FF in both bytes
                           requiring SS3 as the character prefix
Alias: csEUCPkdFmtJapanese
Alias: EUC-JP  (preferred MIME name)



  Jungshik Shin

<Prev in Thread] Current Thread [Next in Thread>