Hello Jungshik!
Our comments go in the same direction, but will you
let me strengthen your statements a bit?
=head1 Encoding vs. Charset
JS> Whether you like it or not, 'charset' is overloaded by MIME to mean
JS> 'encoding' (Character set Encoding Scheme=CES as defined in RFC 2130).
Indeed it is.
RFC 2278 additionally makes it explicit.
JS> On the other hand, no one with *sufficient understanding*
JS> of the issue uses 'character set' to mean encoding.
[ECMA-35, (equivalent of ISO 2022?)]:
coded character set; code
A set of unambiguous rules that establishes a
character set and the one-to-one relationship between the
characters of the set and their coded representation.
[RFC 1345]:
The ISO definition of the term "coded character set" is as
follows: "A set of unambiguous rules that establishes a
character set and the one-to-one relationship between the
characters of the set and their coded representation."
Hmmm... can this potentially lead to messing "character set" for
a short form of "coded character set" (in the ISO meaning)?
I see that these definitions themselves make a distinction between a
"character set" (= repertoire ) and
"coded character set" (= CCS + encoding = CCS + CES),
Jungshik?
Is a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).
JS> Some people like to distinguish between a mere collection of characters
JS> and a collection of characters with uniq(numeric) ID /code points.
JS> The former is sometimes refered to as a character repertoire
JS> or a character set whereas the latter is called a 'coded character set'.
or rather CCS to rule out the ISO understanding
=item Character I<Encoding>
A character encoding may also encode character set as-is (also called
a I<raw> encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).
JS> In a strict sense, the concept of 'raw' or 'as-is' (which you
JS> apparently use to mean a coded character set invoked on GL) is not
JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS> characters to their GL position when enumerating characters in their
JS> charts.
Looks like RFC 1345 has made one big pile:
JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
GB_1988-80
KS_C_5601-1987
are all listed in a similar manner there. Does this RFC change
anything?
JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
JS> are row (ku) and column(ten?) while GB 2312-80 appears to use GL
JS> codepoints.
Thanks a lot! I would have never caught this subtlety from what
reading I have.
JS> That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
JS> and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
JS> column numbers.
I used to be advocating for the RFC 1345 names, but they apparently
were not something to ease the situation (too long and too complex :)
KS_C_5601-1987
has been registered to IANA but when they are used, they are
EUC-coded. Internet community in Korea is not happy with this.
so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
of C<euc-kr>, with ksc5601-raw for "uncooked".
JS> I'm afraid this could give an impression that
JS> IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
JS> ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
JS> MIME charset designation (although the general public used KS C 5601 or
JS> Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
JS> for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
JS> *enhanced* version of EUC-KR. CP949 doesn't have some nice properties
JS> of EUC-KR/JP/CN. Rather, I'd say it's an extension of EUC-KR used
JS> in MS-Windows 9x/ME/NT4/2k/XP. CP949 will never be supported under
JS> Linux/Unix. We'll just go straight to UTF-8.
I have incorporated your ideas into a patch, let's see what Dan
thinks on it! (patch sent in reply to Dan's core message on
Supported.pod renewal)
UTF-16
Awaiting for more comments from you (see bellow)
KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
Haven't tested this one myself :-(
No objections to changing its status. My patch has that.
are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.
JS> Not that I'd encourage people to use UTF-16 for their web pages,
JS> but UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
JS> and Mozilla.
Hmm.. My attempts to use UTF-16 failed with IE5.5..
Has anyone demonstrated it to work?
=item CJK.inf
L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
Somewhat obsolete (last update in 1996), but still useful. Also try
JS> Is there any rule against mentioning a book in print as opposed
JS> to online docs :-) ? Why don't you also refer to a successor to
JS> CJK.inf, CJKV Information Processing with a very comprehensive coverage
JS> on character sets and encodings.
http://www.oreilly.com/catalog/cjkvinfo/ is the link for the book
"CJKV Information Processing" is the name
But someone has to write a good recommendation for that.
Let it be someone who has the book ;-)
Or may it be
Ken Lunde's book "CJKV Information Processing"
http://www.oreilly.com/catalog/cjkvinfo/
Successor to CJK.inf. Features a very comprehensive coverage
on CJKV character sets and encodings.
?
Heartiest regards, Anton