On Wed, 27 Mar 2002, Anton Tagunov wrote:
Hi, Anton,
Very glad to hear you on this list :-)
Me, too :-)
When you say gb2312 and ksc5601, EUC-based encoding is assumed.
JS> Please, don't help spread this misuse.
Well, that was not meant to be applied to GB2312 :-). Below
is more extensive excerpt where I wrote that sentence:
JS> Please, don't help spread this misuse. It might be all right
JS> for the ignorant) public to say KS C 5601 in place of EUC-KR, but Perl
JS> programmers should learn the difference between KS C 5601/KS X 1001 (coded
JS> character set) and encoding/MIME charset/character set encoding scheme/
JS> character coding.
JS> As I wrote before, GB 2312 has been so widely (mis)used that there's
JS> no way to replace it with EUC-CN. Korean situation is much better
JS> although not as good as Japanese case.
It could have been misunderstood.....
Jungshik, one little point on GB2312.. Maybe I misunderstand
something, but
No, you're absolutely right about IANA. See below.
IANA registry (http://www.iana.org/assignments/character-sets)
has
Name: GB2312 (preferred MIME name)
MIBenum: 2025
Source: Chinese for People's Republic of China (PRC) mixed one byte,
two byte set:
20-7E = one byte ASCII
A1-FE = two byte PRC Kanji
See GB 2312-80
PCL Symbol Set Id: 18C
Alias: csGB2312
do not know when was that put in, but it looks EUC-CN. Is it?
And if yes, then GB2312 is a perfectly valid charset, isn't it?
Yes, it's EUC-CN. I was about to add that although
EUC-CN is a better name than GB2312, the former has never been registered
with IANA while the latter was as 'preferred MIME name, You got there
first :-). It's unfortunate that PRC decided to do this way, but that's
what we got and I think we have to respect their decision.
And thank you for explaining how it happened that Korean
misuse the name of a CCS for charset :-)
You're welcome :-)
Actually, I told you only half the story :-). The other half happened
before the widespread use of Internet in Korea (i.e late 1980's and
early 1990's) when people typically refered to what's now called EUC-KR
as 'KS C 5601 Wansung' (= US-ASCII in GL and KS C 5601 in GR). It was
not technically correct, but didn't do much harm because there's no
need for exchange of data over the internet. EUC (Extended Unix Code:
it's not Extended Unix Character) for Korean was first specified in KS
C 5861-1992 (now KS X 2901), but the name EUC-KR appeared first in RFC
1557 where ISO-2022-KR was defined. It would have been better if RFC
1557 had been more explicit in its description of EUC-KR so that IANA
entry for EUC-KR is patterned after that for EUC-JP(GB2312 -> EUC-CN)
with all the code sets and their octet ranges. Perhaps, they
thought just refering to KS C 5861-1992 was sufficient.
----------
Name: EUC-KR (preferred MIME name) [RFC1557,Choi]
MIBenum: 38
Source: RFC-1557 (see also KS_C_5861-1992)
Alias: csEUCKR
----------
Name: Extended_UNIX_Code_Packed_Format_for_Japanese
MIBenum: 18
Source: Standardized by OSF, UNIX International, and UNIX Systems
Laboratories Pacific. Uses ISO 2022 rules to select
code set 0: US-ASCII (a single 7-bit byte set)
code set 1: JIS X0208-1990 (a double 8-bit byte set)
restricted to A0-FF in both bytes
code set 2: Half Width Katakana (a single 7-bit byte set)
requiring SS2 as the character prefix
code set 3: JIS X0212-1990 (a double 7-bit byte set)
restricted to A0-FF in both bytes
requiring SS3 as the character prefix
Alias: csEUCPkdFmtJapanese
Alias: EUC-JP (preferred MIME name)
Jungshik Shin