perl-unicode

Re: Encoding vs Charset

2002-03-26 19:23:23

   And I have found that most of Chinese (Continental; seems like 
Taiwanese are much more technically correct) and Korean mails and web 
pages confuse "charset" and "encodings".  That is, charset="gb2312" 

  IMHO, you're also misusing the term 'charset' here. MIME charset
can be used synonymously with 'encodings' (or 
character set encoding scheme: see CJKV Information Processing,
IETF RFC 2130 and RFC 2278). What has to be distinguished
is 'coded character set' on the one hand (JIS X 0208, JIS X 0212,
KS X 1001, KS X 1003, GB 2312, CNS 11xxx, ISO 10646, ISO 646, US-ASCII,
ISO-8859-x) and 'encoding/character
set encoding scheme/MIME charset on the other hand (EUC-JP,
EUC-KR, EUC-TW, EUC-CN, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN,
ISO-8859-x, UTF-8, UTF-32, UTF-7, UTF-16, Big5, UHC)

  All right in certain context, 'charset' may have been used
to mean 'coded character set', but it'd better be avoided
when you want to compare it to encoding because 'charset'
(in MIME context) also means 'encoding' instead of 'coded character
set'. 

really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr.  
Sadly this misconception is enbedded to popular browsers.

  Well, use of 'ks_c_5601-1987' is the result of an 'evil'
act of Microsoft. We furiously objected it, but M$ went on
to use that name in their products instead of then-well-establisehd 
EUC-KR around 1997. Please, refer to Ken Lunde's CJKV Information Processing
about that 'epic war' between two camps. (see p.197 of
the book and http://jshin.net/faq/qa8.html)
We even set up a web page to prevent M$ from spreading that 
ill-defined name. Anyway,
their designation couldn't withstand the test of the time because
KS C 5601-1987 was renamed KS X 1001:1998. Still, M$ IE and
M$ OE, M$ Frontpage keep producing html docs. However,
it also has to be noted that the encoding
designated as  'ks_c_5601-1987'  by M$ is NOT the same as 
EUC-KR BUT their proprieatary extension of EUC-KR, namely
CP949/UHC/(X-)-Windows-949.  

Sadly this misconception is enbedded to popular browsers.

  MS IE certainly counts as a popular browser, but Mozilla/Netscape
never used 'ks_c_5601-1987' to mean EUC-KR. They always have
used 'EUC-KR'.  Mozilla uses 'X-Windows-949' to mean CP949/UHC
and 'ks_c_5601-1987' is an alias to 'X-Windows-949' (but
Mozilla will never have 'ks_c_5601-1987' in outgoing messages/docs.
It only accept html/emails labeled that way as in X-Windows-949).

In case of 'GB2312' in place of 'EUC-CN',
the situation was beyond repair (Ken Lunde's book
was too late and an error-prone book by a Japanese engineer
working at MS published a few years earlier spread the
misconception too widely)  so that the name just stuck.
As for Taiwan, the reason there's no confusion between
coded character set and encoding is not because they're
technically correct but because in their case EUC-TW
has never been used widely while the popular encoding
Big5 has much more complex relationship with CNS 11xxx
than EUC-KR with KS X 1001 and EUC-CN with GB 2312.
(Big5 vs CNS 11xxx is similar to Shift_JIS vs JIS X 0208)

   Jungshik Shin

<Prev in Thread] Current Thread [Next in Thread>