On Wed, 3 Apr 2002, Dan Kogai wrote:
Dan,
Thank you for your write-up. Below are some comments.
o The MIME name as defined in IETF RFCs.
UCS-2 ucs2, iso-10646-1 [IANA, et al]
UCS-2le
UTF-8 utf8 [RFC2279]
----------------------------------------------------------------
How about UCS-2BE? Of course, if UCS-2 is network byte order
(big endian), it's not necessary. In that case, you may alias UCS-2
to UCS-2BE.
=item Encode::KR -- Korea
----------------------------------------------------------------
euc-kr MacKorean
euc-kr MacKorean [RFC1557, IANA, KS X 2901]
cp949 ks_c_5601-1987
iso-2022-kr [RFC1557]
johab
johab KS X 1001:1998 Annex 3
ksc5601-raw KSC5601 as is
----------------------------------------------------------------
=item Vietnamese encodings VPS
Mozilla supports VPS. See
http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf
http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut
=head1 Encoding vs. Charset
Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.
=item Character I<Set> (I<charset> for short)
Could you please just say 'Encoding vs Character Set'
and remove parenthetical 'charset for short' or 'just charset' following
'character set'? I agree to your distinction between 'encoding' and
'character set', but what is bothering me is that you treat 'charset'
as a synonym to 'character set'.
Whether you like it or not, 'charset' is overloaded by MIME to mean
'encoding' (Character set Encoding Scheme=CES as defined in RFC 2130).
Everyday numerous html documents are produced with meta tags that read
'Content-Type=text/html; charset=XXXX'. The same is true of email
messages with C-T header like 'text/plain; charset=ISO-2022-JP'.
Therefore 'Encoding vs Charset' can be interpreted as 'Encoding vs
Encoding'. On the other hand, no one with *sufficient understanding*
of the issue uses 'character set' to mean encoding.
Is a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).
Some people like to distinguish between a mere collection of characters
and a collection of characters with uniq(numeric) ID /code points.
The former is sometimes refered to as a character repertoire
or a character set whereas the latter is called a 'coded character set'.
=item Character I<Encoding>
A character encoding may also encode character set as-is (also called
a I<raw> encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).
In a strict sense, the concept of 'raw' or 'as-is' (which you
apparently use to mean a coded character set invoked on GL) is not
appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
characters to their GL position when enumerating characters in their
charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
are row (ku) and column(ten?) while GB 2312-80 appears to use GL
codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
column numbers.
KS_C_5601-1987
has been registered to IANA but when they are used, they are
EUC-coded. Internet community in Korea is not happy with this.
so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
of C<euc-kr>, with ksc5601-raw for "uncooked".
I'm afraid this could give an impression that
IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
MIME charset designation (although the general public used KS C 5601 or
Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
*enhanced* version of EUC-KR. CP949 doesn't have some nice properties
of EUC-KR/JP/CN. Rather, I'd say it's an extension of EUC-KR used
in MS-Windows 9x/ME/NT4/2k/XP. CP949 will never be supported under
Linux/Unix. We'll just go straight to UTF-8.
UTF-16
KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.
Not that I'd encourage people to use UTF-16 for their web pages,
but UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
and Mozilla.
=item CJK.inf
L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
Somewhat obsolete (last update in 1996), but still useful. Also try
Is there any rule against mentioning a book in print as opposed
to online docs :-) ? Why don't you also refer to a successor to
CJK.inf, CJKV Information Processing with a very comprehensive coverage
on character sets and encodings.
Cheers,
Jungshik